Spark 2
Alexey Zinovyev, Java/BigData Trainer in EPAM
About
With IT since 2007
With Java since 2009
With Hadoop since 2012
With EPAM since 2015
3Spark 2 from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
4Spark 2 from Zinoviev Alexey
Sprk Dvlprs! Let’s start!
5Spark 2 from Zinoviev Alexey
< SPARK 2.0
6Spark 2 from Zinoviev Alexey
Modern Java in 2016Big Data in 2014
7Spark 2 from Zinoviev Alexey
Big Data in 2017
8Spark 2 from Zinoviev Alexey
Machine Learning EVERYWHERE
9Spark 2 from Zinoviev Alexey
Machine Learning vs Traditional Programming
10Spark 2 from Zinoviev Alexey
11Spark 2 from Zinoviev Alexey
Something wrong with HADOOP
12Spark 2 from Zinoviev Alexey
Hadoop is not SEXY
13Spark 2 from Zinoviev Alexey
Whaaaat?
14Spark 2 from Zinoviev Alexey
Map Reduce Job Writing
15Spark 2 from Zinoviev Alexey
MR code
16Spark 2 from Zinoviev Alexey
Hadoop Developers Right Now
17Spark 2 from Zinoviev Alexey
Iterative Calculations
10x – 100x
18Spark 2 from Zinoviev Alexey
MapReduce vs Spark
19Spark 2 from Zinoviev Alexey
MapReduce vs Spark
20Spark 2 from Zinoviev Alexey
MapReduce vs Spark
21Spark 2 from Zinoviev Alexey
SPARK INTRO
22Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
23Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
val textFile = sc.textFile("hdfs://...")
24Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
25Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
26Spark 2 from Zinoviev Alexey
Loading
val localData = (5,7,1,12,10,25)
val ourFirstRDD = sc.parallelize(localData)
// from file
val textFile = sc.textFile("hdfs://...")
27Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
28Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
29Spark 2 from Zinoviev Alexey
Word
Count
val textFile = sc.textFile("hdfs://...")
val counts = textFile
.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
30Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
31Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
32Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
33Spark 2 from Zinoviev Alexey
SQL
val hive = new HiveContext(spark)
hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT,
value STRING)”)
hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO
TABLE src”)
val results = hive.hql(“FROM src SELECT key,
value”).collect()
34Spark 2 from Zinoviev Alexey
RDD Factory
35Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
36Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
37Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
38Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
39Spark 2 from Zinoviev Alexey
DStream
val conf = new SparkConf().setMaster("local[2]")
.setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
40Spark 2 from Zinoviev Alexey
It’s very easy to understand Graph algorithms
def pregel[A](initialMsg: A, maxIterations: Int, activeDirection:
EdgeDirection)(
vprog: (VertexID, VD, A) => VD,
sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
mergeMsg: (A, A) => A)
: Graph[VD, ED] = { <Your Pregel Computations> }
41Spark 2 from Zinoviev Alexey
SPARK 2.0 DISCUSSION
42Spark 2 from Zinoviev Alexey
Spark
Family
43Spark 2 from Zinoviev Alexey
Spark
Family
44Spark 2 from Zinoviev Alexey
Case #0 : How to avoid DStreams with RDD-like API?
45Spark 2 from Zinoviev Alexey
Continuous Applications
46Spark 2 from Zinoviev Alexey
Continuous Applications cases
• Updating data that will be served in real time
• Extract, transform and load (ETL)
• Creating a real-time version of an existing batch job
• Online machine learning
47Spark 2 from Zinoviev Alexey
Write Batches
48Spark 2 from Zinoviev Alexey
Catch Streaming
49Spark 2 from Zinoviev Alexey
The main concept of Structured Streaming
You can express your streaming computation the
same way you would express a batch computation
on static data.
50Spark 2 from Zinoviev Alexey
Batch
// Read JSON once from S3
logsDF = spark.read.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.write.parquet("s3://out")
51Spark 2 from Zinoviev Alexey
Real
Time
// Read JSON continuously from S3
logsDF = spark.readStream.json("s3://logs")
// Transform with DataFrame API and save
logsDF.select("user", "url", "date")
.writeStream.parquet("s3://out")
.start()
52Spark 2 from Zinoviev Alexey
Unlimited Table
53Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
54Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
55Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
56Spark 2 from Zinoviev Alexey
WordCount
from
Socket
val lines = spark.readStream
.format("socket")
.option("host", "localhost")
.option("port", 9999)
.load()
val words = lines.as[String].flatMap(_.split(" "))
val wordCounts = words.groupBy("value").count()
Don’t forget
to start
Streaming
57Spark 2 from Zinoviev Alexey
WordCount with Structured Streaming
58Spark 2 from Zinoviev Alexey
Structured Streaming provides …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
59Spark 2 from Zinoviev Alexey
Structured Streaming provides (in dreams) …
• fast & scalable
• fault-tolerant
• end-to-end with exactly-once semantic
• stream processing
• ability to use DataFrame/DataSet API for streaming
60Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
61Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
org.apache.spark.sql.
AnalysisException:
Union between streaming
and batch
DataFrames/Datasets is not
supported;
62Spark 2 from Zinoviev Alexey
Let’s UNION streaming and static DataSets
Go to UnsupportedOperationChecker.scala and check your
operation
63Spark 2 from Zinoviev Alexey
Case #1 : We should think about optimization in RDD terms
64Spark 2 from Zinoviev Alexey
Single
Thread
collection
65Spark 2 from Zinoviev Alexey
No perf
issues,
right?
66Spark 2 from Zinoviev Alexey
The main concept
more partitions = more parallelism
67Spark 2 from Zinoviev Alexey
Do it
parallel
68Spark 2 from Zinoviev Alexey
I’d like
NARROW
69Spark 2 from Zinoviev Alexey
Map, filter, filter
70Spark 2 from Zinoviev Alexey
GroupByKey, join
71Spark 2 from Zinoviev Alexey
Case #2 : DataFrames suggest mix SQL and Scala functions
72Spark 2 from Zinoviev Alexey
History of Spark APIs
73Spark 2 from Zinoviev Alexey
RDD
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
74Spark 2 from Zinoviev Alexey
History of Spark APIs
75Spark 2 from Zinoviev Alexey
SQL
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
76Spark 2 from Zinoviev Alexey
Expression
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
77Spark 2 from Zinoviev Alexey
History of Spark APIs
78Spark 2 from Zinoviev Alexey
DataSet
rdd.filter(_.age > 21) // RDD
df.filter("age > 21") // DataFrame SQL-style
df.filter(df.col("age").gt(21)) // Expression style
dataset.filter(_.age < 21); // Dataset API
79Spark 2 from Zinoviev Alexey
Case #2 : DataFrame is referring to data attributes by name
80Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
81Spark 2 from Zinoviev Alexey
DataSet = RDD’s types + DataFrame’s Catalyst
• RDD API
• compile-time type-safety
• off-heap storage mechanism
• performance benefits of the Catalyst query optimizer
• Tungsten
82Spark 2 from Zinoviev Alexey
Structured APIs in SPARK
83Spark 2 from Zinoviev Alexey
Unified API in Spark 2.0
DataFrame = Dataset[Row]
Dataframe is a schemaless (untyped) Dataset now
84Spark 2 from Zinoviev Alexey
Define
case class
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
85Spark 2 from Zinoviev Alexey
Read JSON
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
86Spark 2 from Zinoviev Alexey
Filter by
Field
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
87Spark 2 from Zinoviev Alexey
Case #3 : Spark has many contexts
88Spark 2 from Zinoviev Alexey
Spark Session
• New entry point in spark for creating datasets
• Replaces SQLContext, HiveContext and StreamingContext
• Move from SparkContext to SparkSession signifies move
away from RDD
89Spark 2 from Zinoviev Alexey
Spark
Session
val sparkSession = SparkSession.builder
.master("local")
.appName("spark session example")
.getOrCreate()
val df = sparkSession.read
.option("header","true")
.csv("src/main/resources/names.csv")
df.show()
90Spark 2 from Zinoviev Alexey
No, I want to create my lovely RDDs
91Spark 2 from Zinoviev Alexey
Where’s parallelize() method?
92Spark 2 from Zinoviev Alexey
RDD?
case class User(email: String, footSize: Long, name: String)
// DataFrame -> DataSet with Users
val userDS =
spark.read.json("/home/tmp/datasets/users.json").as[User]
userDS.map(_.name).collect()
userDS.filter(_.footSize > 38).collect()
ds.rdd // IF YOU REALLY WANT
93Spark 2 from Zinoviev Alexey
Case #4 : Spark uses Java serialization A LOT
94Spark 2 from Zinoviev Alexey
Two choices to distribute data across cluster
• Java serialization
By default with ObjectOutputStream
• Kryo serialization
Should register classes (no support of Serialazible)
95Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
96Spark 2 from Zinoviev Alexey
The main problem: overhead of serializing
Each serialized object contains the class structure as
well as the values
Don’t forget about GC
97Spark 2 from Zinoviev Alexey
Tungsten Compact Encoding
98Spark 2 from Zinoviev Alexey
Encoder’s concept
Generate bytecode to interact with off-heap
&
Give access to attributes without ser/deser
99Spark 2 from Zinoviev Alexey
Encoders
100Spark 2 from Zinoviev Alexey
No custom encoders
101Spark 2 from Zinoviev Alexey
Case #5 : Not enough storage levels 
102Spark 2 from Zinoviev Alexey
Caching in Spark
• Frequently used RDD can be stored in memory
• One method, one short-cut: persist(), cache()
• SparkContext keeps track of cached RDD
• Serialized or deserialized Java objects
103Spark 2 from Zinoviev Alexey
Full list of options
• MEMORY_ONLY
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
104Spark 2 from Zinoviev Alexey
Spark Core Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
105Spark 2 from Zinoviev Alexey
Spark Streaming Storage Level
• MEMORY_ONLY (default for Spark Core)
• MEMORY_AND_DISK
• MEMORY_ONLY_SER (default for Spark Streaming)
• MEMORY_AND_DISK_SER
• DISK_ONLY
• MEMORY_ONLY_2, MEMORY_AND_DISK_2
106Spark 2 from Zinoviev Alexey
Developer API to make new Storage Levels
107Spark 2 from Zinoviev Alexey
What’s the most popular file format in BigData?
108Spark 2 from Zinoviev Alexey
Case #6 : External libraries to read CSV
109Spark 2 from Zinoviev Alexey
Easy to
read CSV
data = sqlContext.read.format("csv")
.option("header", "true")
.option("inferSchema", "true")
.load("/datasets/samples/users.csv")
data.cache()
data.createOrReplaceTempView(“users")
display(data)
110Spark 2 from Zinoviev Alexey
Case #7 : How to measure Spark performance?
111Spark 2 from Zinoviev Alexey
You'd measure performance!
112Spark 2 from Zinoviev Alexey
TPCDS
99 Queries
http://bit.ly/2dObMsH
113Spark 2 from Zinoviev Alexey
114Spark 2 from Zinoviev Alexey
How to benchmark Spark
115Spark 2 from Zinoviev Alexey
Special Tool from Databricks
Benchmark Tool for SparkSQL
https://github.com/databricks/spark-sql-perf
116Spark 2 from Zinoviev Alexey
Spark 2 vs Spark 1.6
117Spark 2 from Zinoviev Alexey
Case #8 : What’s faster: SQL or DataSet API?
118Spark 2 from Zinoviev Alexey
Job Stages in old Spark
119Spark 2 from Zinoviev Alexey
Scheduler Optimizations
120Spark 2 from Zinoviev Alexey
Catalyst Optimizer for DataFrames
121Spark 2 from Zinoviev Alexey
Unified Logical Plan
DataFrame = Dataset[Row]
122Spark 2 from Zinoviev Alexey
Bytecode
123Spark 2 from Zinoviev Alexey
DataSet.explain()
== Physical Plan ==
Project [avg(price)#43,carat#45]
+- SortMergeJoin [color#21], [color#47]
:- Sort [color#21 ASC], false, 0
: +- TungstenExchange hashpartitioning(color#21,200), None
: +- Project [avg(price)#43,color#21]
: +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as
bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43])
: +- TungstenExchange hashpartitioning(cut#20,color#21,200), None
: +- TungstenAggregate(key=[cut#20,color#21],
functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)],
output=[cut#20,color#21,sum#58,count#59L])
: +- Scan CsvRelation(-----)
+- Sort [color#47 ASC], false, 0
+- TungstenExchange hashpartitioning(color#47,200), None
+- ConvertToUnsafe
+- Scan CsvRelation(----)
124Spark 2 from Zinoviev Alexey
Case #9 : Why does explain() show so many Tungsten things?
125Spark 2 from Zinoviev Alexey
How to be effective with CPU
• Runtime code generation
• Exploiting cache locality
• Off-heap memory management
126Spark 2 from Zinoviev Alexey
Tungsten’s goal
Push performance closer to the limits of modern
hardware
127Spark 2 from Zinoviev Alexey
Maybe something UNSAFE?
128Spark 2 from Zinoviev Alexey
UnsafeRowFormat
• Bit set for tracking null values
• Small values are inlined
• For variable-length values are stored relative offset into the
variablelength data section
• Rows are always 8-byte word aligned
• Equality comparison and hashing can be performed on raw
bytes without requiring additional interpretation
129Spark 2 from Zinoviev Alexey
Case #10 : Can I influence on Memory Management in Spark?
130Spark 2 from Zinoviev Alexey
Case #11 : Should I tune generation’s stuff?
131Spark 2 from Zinoviev Alexey
Cached
Data
132Spark 2 from Zinoviev Alexey
During
operations
133Spark 2 from Zinoviev Alexey
For your
needs
134Spark 2 from Zinoviev Alexey
For Dark
Lord
135Spark 2 from Zinoviev Alexey
IN CONCLUSION
136Spark 2 from Zinoviev Alexey
We have no ability…
• join structured streaming and other sources to handle it
• one unified ML API
• GraphX rethinking and redesign
• Custom encoders
• Datasets everywhere
• integrate with something important
137Spark 2 from Zinoviev Alexey
Roadmap
• Support other data sources (not only S3 + HDFS)
• Transactional updates
• Dataset is one DSL for all operations
• GraphFrames + Structured MLLib
• Tungsten: custom encoders
• The RDD-based API is expected to be removed in Spark
3.0
138Spark 2 from Zinoviev Alexey
And we can DO IT!
Real-Time Data-Marts
Batch Data-Marts
Relations Graph
Ontology Metadata
Search Index
Events & Alarms
Real-time
Dashboarding
Events & Alarms
All Raw Data backup
is stored here
Real-time Data
Ingestion
Batch Data
Ingestion
Real-Time ETL & CEP
Batch ETL & Raw Area
Scheduler
Internal
External
Social
HDFS → CFS
as an option
Time-Series Data
Titan & KairosDB
store data in Cassandra
Push Events & Alarms (Email, SNMP etc.)
139Spark 2 from Zinoviev Alexey
First Part
140Spark 2 from Zinoviev Alexey
Second
Part
141Spark 2 from Zinoviev Alexey
Contacts
E-mail : Alexey_Zinovyev@epam.com
Twitter : @zaleslaw @BigDataRussia
Facebook: https://www.facebook.com/zaleslaw
vk.com/big_data_russia Big Data Russia
vk.com/java_jvm Java & JVM langs
142Spark 2 from Zinoviev Alexey
Any questions?

Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)

  • 1.
    Spark 2 Alexey Zinovyev,Java/BigData Trainer in EPAM
  • 2.
    About With IT since2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015
  • 3.
    3Spark 2 fromZinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  • 4.
    4Spark 2 fromZinoviev Alexey Sprk Dvlprs! Let’s start!
  • 5.
    5Spark 2 fromZinoviev Alexey < SPARK 2.0
  • 6.
    6Spark 2 fromZinoviev Alexey Modern Java in 2016Big Data in 2014
  • 7.
    7Spark 2 fromZinoviev Alexey Big Data in 2017
  • 8.
    8Spark 2 fromZinoviev Alexey Machine Learning EVERYWHERE
  • 9.
    9Spark 2 fromZinoviev Alexey Machine Learning vs Traditional Programming
  • 10.
    10Spark 2 fromZinoviev Alexey
  • 11.
    11Spark 2 fromZinoviev Alexey Something wrong with HADOOP
  • 12.
    12Spark 2 fromZinoviev Alexey Hadoop is not SEXY
  • 13.
    13Spark 2 fromZinoviev Alexey Whaaaat?
  • 14.
    14Spark 2 fromZinoviev Alexey Map Reduce Job Writing
  • 15.
    15Spark 2 fromZinoviev Alexey MR code
  • 16.
    16Spark 2 fromZinoviev Alexey Hadoop Developers Right Now
  • 17.
    17Spark 2 fromZinoviev Alexey Iterative Calculations 10x – 100x
  • 18.
    18Spark 2 fromZinoviev Alexey MapReduce vs Spark
  • 19.
    19Spark 2 fromZinoviev Alexey MapReduce vs Spark
  • 20.
    20Spark 2 fromZinoviev Alexey MapReduce vs Spark
  • 21.
    21Spark 2 fromZinoviev Alexey SPARK INTRO
  • 22.
    22Spark 2 fromZinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  • 23.
    23Spark 2 fromZinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) val textFile = sc.textFile("hdfs://...")
  • 24.
    24Spark 2 fromZinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  • 25.
    25Spark 2 fromZinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  • 26.
    26Spark 2 fromZinoviev Alexey Loading val localData = (5,7,1,12,10,25) val ourFirstRDD = sc.parallelize(localData) // from file val textFile = sc.textFile("hdfs://...")
  • 27.
    27Spark 2 fromZinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 28.
    28Spark 2 fromZinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 29.
    29Spark 2 fromZinoviev Alexey Word Count val textFile = sc.textFile("hdfs://...") val counts = textFile .flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • 30.
    30Spark 2 fromZinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 31.
    31Spark 2 fromZinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 32.
    32Spark 2 fromZinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 33.
    33Spark 2 fromZinoviev Alexey SQL val hive = new HiveContext(spark) hive.hql(“CREATE TABLE IF NOT EXISTS src (key INT, value STRING)”) hive.hql(“LOAD DATA LOCAL INPATH ‘…/kv1.txt’ INTO TABLE src”) val results = hive.hql(“FROM src SELECT key, value”).collect()
  • 34.
    34Spark 2 fromZinoviev Alexey RDD Factory
  • 35.
    35Spark 2 fromZinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 36.
    36Spark 2 fromZinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 37.
    37Spark 2 fromZinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 38.
    38Spark 2 fromZinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 39.
    39Spark 2 fromZinoviev Alexey DStream val conf = new SparkConf().setMaster("local[2]") .setAppName("NetworkWordCount") val ssc = new StreamingContext(conf, Seconds(1)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")) val pairs = words.map(word => (word, 1)) val wordCounts = pairs.reduceByKey(_ + _) wordCounts.print() ssc.start() ssc.awaitTermination()
  • 40.
    40Spark 2 fromZinoviev Alexey It’s very easy to understand Graph algorithms def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)( vprog: (VertexID, VD, A) => VD, sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)], mergeMsg: (A, A) => A) : Graph[VD, ED] = { <Your Pregel Computations> }
  • 41.
    41Spark 2 fromZinoviev Alexey SPARK 2.0 DISCUSSION
  • 42.
    42Spark 2 fromZinoviev Alexey Spark Family
  • 43.
    43Spark 2 fromZinoviev Alexey Spark Family
  • 44.
    44Spark 2 fromZinoviev Alexey Case #0 : How to avoid DStreams with RDD-like API?
  • 45.
    45Spark 2 fromZinoviev Alexey Continuous Applications
  • 46.
    46Spark 2 fromZinoviev Alexey Continuous Applications cases • Updating data that will be served in real time • Extract, transform and load (ETL) • Creating a real-time version of an existing batch job • Online machine learning
  • 47.
    47Spark 2 fromZinoviev Alexey Write Batches
  • 48.
    48Spark 2 fromZinoviev Alexey Catch Streaming
  • 49.
    49Spark 2 fromZinoviev Alexey The main concept of Structured Streaming You can express your streaming computation the same way you would express a batch computation on static data.
  • 50.
    50Spark 2 fromZinoviev Alexey Batch // Read JSON once from S3 logsDF = spark.read.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .write.parquet("s3://out")
  • 51.
    51Spark 2 fromZinoviev Alexey Real Time // Read JSON continuously from S3 logsDF = spark.readStream.json("s3://logs") // Transform with DataFrame API and save logsDF.select("user", "url", "date") .writeStream.parquet("s3://out") .start()
  • 52.
    52Spark 2 fromZinoviev Alexey Unlimited Table
  • 53.
    53Spark 2 fromZinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 54.
    54Spark 2 fromZinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 55.
    55Spark 2 fromZinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count()
  • 56.
    56Spark 2 fromZinoviev Alexey WordCount from Socket val lines = spark.readStream .format("socket") .option("host", "localhost") .option("port", 9999) .load() val words = lines.as[String].flatMap(_.split(" ")) val wordCounts = words.groupBy("value").count() Don’t forget to start Streaming
  • 57.
    57Spark 2 fromZinoviev Alexey WordCount with Structured Streaming
  • 58.
    58Spark 2 fromZinoviev Alexey Structured Streaming provides … • fast & scalable • fault-tolerant • end-to-end with exactly-once semantic • stream processing • ability to use DataFrame/DataSet API for streaming
  • 59.
    59Spark 2 fromZinoviev Alexey Structured Streaming provides (in dreams) … • fast & scalable • fault-tolerant • end-to-end with exactly-once semantic • stream processing • ability to use DataFrame/DataSet API for streaming
  • 60.
    60Spark 2 fromZinoviev Alexey Let’s UNION streaming and static DataSets
  • 61.
    61Spark 2 fromZinoviev Alexey Let’s UNION streaming and static DataSets org.apache.spark.sql. AnalysisException: Union between streaming and batch DataFrames/Datasets is not supported;
  • 62.
    62Spark 2 fromZinoviev Alexey Let’s UNION streaming and static DataSets Go to UnsupportedOperationChecker.scala and check your operation
  • 63.
    63Spark 2 fromZinoviev Alexey Case #1 : We should think about optimization in RDD terms
  • 64.
    64Spark 2 fromZinoviev Alexey Single Thread collection
  • 65.
    65Spark 2 fromZinoviev Alexey No perf issues, right?
  • 66.
    66Spark 2 fromZinoviev Alexey The main concept more partitions = more parallelism
  • 67.
    67Spark 2 fromZinoviev Alexey Do it parallel
  • 68.
    68Spark 2 fromZinoviev Alexey I’d like NARROW
  • 69.
    69Spark 2 fromZinoviev Alexey Map, filter, filter
  • 70.
    70Spark 2 fromZinoviev Alexey GroupByKey, join
  • 71.
    71Spark 2 fromZinoviev Alexey Case #2 : DataFrames suggest mix SQL and Scala functions
  • 72.
    72Spark 2 fromZinoviev Alexey History of Spark APIs
  • 73.
    73Spark 2 fromZinoviev Alexey RDD rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 74.
    74Spark 2 fromZinoviev Alexey History of Spark APIs
  • 75.
    75Spark 2 fromZinoviev Alexey SQL rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 76.
    76Spark 2 fromZinoviev Alexey Expression rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 77.
    77Spark 2 fromZinoviev Alexey History of Spark APIs
  • 78.
    78Spark 2 fromZinoviev Alexey DataSet rdd.filter(_.age > 21) // RDD df.filter("age > 21") // DataFrame SQL-style df.filter(df.col("age").gt(21)) // Expression style dataset.filter(_.age < 21); // Dataset API
  • 79.
    79Spark 2 fromZinoviev Alexey Case #2 : DataFrame is referring to data attributes by name
  • 80.
    80Spark 2 fromZinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  • 81.
    81Spark 2 fromZinoviev Alexey DataSet = RDD’s types + DataFrame’s Catalyst • RDD API • compile-time type-safety • off-heap storage mechanism • performance benefits of the Catalyst query optimizer • Tungsten
  • 82.
    82Spark 2 fromZinoviev Alexey Structured APIs in SPARK
  • 83.
    83Spark 2 fromZinoviev Alexey Unified API in Spark 2.0 DataFrame = Dataset[Row] Dataframe is a schemaless (untyped) Dataset now
  • 84.
    84Spark 2 fromZinoviev Alexey Define case class case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 85.
    85Spark 2 fromZinoviev Alexey Read JSON case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 86.
    86Spark 2 fromZinoviev Alexey Filter by Field case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 87.
    87Spark 2 fromZinoviev Alexey Case #3 : Spark has many contexts
  • 88.
    88Spark 2 fromZinoviev Alexey Spark Session • New entry point in spark for creating datasets • Replaces SQLContext, HiveContext and StreamingContext • Move from SparkContext to SparkSession signifies move away from RDD
  • 89.
    89Spark 2 fromZinoviev Alexey Spark Session val sparkSession = SparkSession.builder .master("local") .appName("spark session example") .getOrCreate() val df = sparkSession.read .option("header","true") .csv("src/main/resources/names.csv") df.show()
  • 90.
    90Spark 2 fromZinoviev Alexey No, I want to create my lovely RDDs
  • 91.
    91Spark 2 fromZinoviev Alexey Where’s parallelize() method?
  • 92.
    92Spark 2 fromZinoviev Alexey RDD? case class User(email: String, footSize: Long, name: String) // DataFrame -> DataSet with Users val userDS = spark.read.json("/home/tmp/datasets/users.json").as[User] userDS.map(_.name).collect() userDS.filter(_.footSize > 38).collect() ds.rdd // IF YOU REALLY WANT
  • 93.
    93Spark 2 fromZinoviev Alexey Case #4 : Spark uses Java serialization A LOT
  • 94.
    94Spark 2 fromZinoviev Alexey Two choices to distribute data across cluster • Java serialization By default with ObjectOutputStream • Kryo serialization Should register classes (no support of Serialazible)
  • 95.
    95Spark 2 fromZinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values
  • 96.
    96Spark 2 fromZinoviev Alexey The main problem: overhead of serializing Each serialized object contains the class structure as well as the values Don’t forget about GC
  • 97.
    97Spark 2 fromZinoviev Alexey Tungsten Compact Encoding
  • 98.
    98Spark 2 fromZinoviev Alexey Encoder’s concept Generate bytecode to interact with off-heap & Give access to attributes without ser/deser
  • 99.
    99Spark 2 fromZinoviev Alexey Encoders
  • 100.
    100Spark 2 fromZinoviev Alexey No custom encoders
  • 101.
    101Spark 2 fromZinoviev Alexey Case #5 : Not enough storage levels 
  • 102.
    102Spark 2 fromZinoviev Alexey Caching in Spark • Frequently used RDD can be stored in memory • One method, one short-cut: persist(), cache() • SparkContext keeps track of cached RDD • Serialized or deserialized Java objects
  • 103.
    103Spark 2 fromZinoviev Alexey Full list of options • MEMORY_ONLY • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  • 104.
    104Spark 2 fromZinoviev Alexey Spark Core Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  • 105.
    105Spark 2 fromZinoviev Alexey Spark Streaming Storage Level • MEMORY_ONLY (default for Spark Core) • MEMORY_AND_DISK • MEMORY_ONLY_SER (default for Spark Streaming) • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2
  • 106.
    106Spark 2 fromZinoviev Alexey Developer API to make new Storage Levels
  • 107.
    107Spark 2 fromZinoviev Alexey What’s the most popular file format in BigData?
  • 108.
    108Spark 2 fromZinoviev Alexey Case #6 : External libraries to read CSV
  • 109.
    109Spark 2 fromZinoviev Alexey Easy to read CSV data = sqlContext.read.format("csv") .option("header", "true") .option("inferSchema", "true") .load("/datasets/samples/users.csv") data.cache() data.createOrReplaceTempView(“users") display(data)
  • 110.
    110Spark 2 fromZinoviev Alexey Case #7 : How to measure Spark performance?
  • 111.
    111Spark 2 fromZinoviev Alexey You'd measure performance!
  • 112.
    112Spark 2 fromZinoviev Alexey TPCDS 99 Queries http://bit.ly/2dObMsH
  • 113.
    113Spark 2 fromZinoviev Alexey
  • 114.
    114Spark 2 fromZinoviev Alexey How to benchmark Spark
  • 115.
    115Spark 2 fromZinoviev Alexey Special Tool from Databricks Benchmark Tool for SparkSQL https://github.com/databricks/spark-sql-perf
  • 116.
    116Spark 2 fromZinoviev Alexey Spark 2 vs Spark 1.6
  • 117.
    117Spark 2 fromZinoviev Alexey Case #8 : What’s faster: SQL or DataSet API?
  • 118.
    118Spark 2 fromZinoviev Alexey Job Stages in old Spark
  • 119.
    119Spark 2 fromZinoviev Alexey Scheduler Optimizations
  • 120.
    120Spark 2 fromZinoviev Alexey Catalyst Optimizer for DataFrames
  • 121.
    121Spark 2 fromZinoviev Alexey Unified Logical Plan DataFrame = Dataset[Row]
  • 122.
    122Spark 2 fromZinoviev Alexey Bytecode
  • 123.
    123Spark 2 fromZinoviev Alexey DataSet.explain() == Physical Plan == Project [avg(price)#43,carat#45] +- SortMergeJoin [color#21], [color#47] :- Sort [color#21 ASC], false, 0 : +- TungstenExchange hashpartitioning(color#21,200), None : +- Project [avg(price)#43,color#21] : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Final,isDistinct=false)], output=[color#21,avg(price)#43]) : +- TungstenExchange hashpartitioning(cut#20,color#21,200), None : +- TungstenAggregate(key=[cut#20,color#21], functions=[(avg(cast(price#25 as bigint)),mode=Partial,isDistinct=false)], output=[cut#20,color#21,sum#58,count#59L]) : +- Scan CsvRelation(-----) +- Sort [color#47 ASC], false, 0 +- TungstenExchange hashpartitioning(color#47,200), None +- ConvertToUnsafe +- Scan CsvRelation(----)
  • 124.
    124Spark 2 fromZinoviev Alexey Case #9 : Why does explain() show so many Tungsten things?
  • 125.
    125Spark 2 fromZinoviev Alexey How to be effective with CPU • Runtime code generation • Exploiting cache locality • Off-heap memory management
  • 126.
    126Spark 2 fromZinoviev Alexey Tungsten’s goal Push performance closer to the limits of modern hardware
  • 127.
    127Spark 2 fromZinoviev Alexey Maybe something UNSAFE?
  • 128.
    128Spark 2 fromZinoviev Alexey UnsafeRowFormat • Bit set for tracking null values • Small values are inlined • For variable-length values are stored relative offset into the variablelength data section • Rows are always 8-byte word aligned • Equality comparison and hashing can be performed on raw bytes without requiring additional interpretation
  • 129.
    129Spark 2 fromZinoviev Alexey Case #10 : Can I influence on Memory Management in Spark?
  • 130.
    130Spark 2 fromZinoviev Alexey Case #11 : Should I tune generation’s stuff?
  • 131.
    131Spark 2 fromZinoviev Alexey Cached Data
  • 132.
    132Spark 2 fromZinoviev Alexey During operations
  • 133.
    133Spark 2 fromZinoviev Alexey For your needs
  • 134.
    134Spark 2 fromZinoviev Alexey For Dark Lord
  • 135.
    135Spark 2 fromZinoviev Alexey IN CONCLUSION
  • 136.
    136Spark 2 fromZinoviev Alexey We have no ability… • join structured streaming and other sources to handle it • one unified ML API • GraphX rethinking and redesign • Custom encoders • Datasets everywhere • integrate with something important
  • 137.
    137Spark 2 fromZinoviev Alexey Roadmap • Support other data sources (not only S3 + HDFS) • Transactional updates • Dataset is one DSL for all operations • GraphFrames + Structured MLLib • Tungsten: custom encoders • The RDD-based API is expected to be removed in Spark 3.0
  • 138.
    138Spark 2 fromZinoviev Alexey And we can DO IT! Real-Time Data-Marts Batch Data-Marts Relations Graph Ontology Metadata Search Index Events & Alarms Real-time Dashboarding Events & Alarms All Raw Data backup is stored here Real-time Data Ingestion Batch Data Ingestion Real-Time ETL & CEP Batch ETL & Raw Area Scheduler Internal External Social HDFS → CFS as an option Time-Series Data Titan & KairosDB store data in Cassandra Push Events & Alarms (Email, SNMP etc.)
  • 139.
    139Spark 2 fromZinoviev Alexey First Part
  • 140.
    140Spark 2 fromZinoviev Alexey Second Part
  • 141.
    141Spark 2 fromZinoviev Alexey Contacts E-mail : Alexey_Zinovyev@epam.com Twitter : @zaleslaw @BigDataRussia Facebook: https://www.facebook.com/zaleslaw vk.com/big_data_russia Big Data Russia vk.com/java_jvm Java & JVM langs
  • 142.
    142Spark 2 fromZinoviev Alexey Any questions?