Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

Presto and ScyllaDB
Tzach Livyatan Eyal Gutkind
ScyllaDB ScyllaDB

Agenda
• What is Presto?
• Why Presto?
• Scylla + Presto
▸ Connector
▸ Examples
• What’s Next?

What is Presto?
• Distributed ANSI SQL Query Engine for Big Data
• Developed by Facebook in 2012
• Data sources: HDFS, S3, Cassandra, MySQL, Kafka,
PostgreSQL, Redis and Scylla
• Open Source
• Java Base

Why Presto?
Presto
Interactive Data
Exploration
Source http://techblog.netflix.com/2014/10/usi ng-presto-in-our-big-data-platform.html
Spark
Real Time Analytics
Machine Learning
Iterative

Why Presto?
• ANSI SQL
• Extensible - multiple data
sources
• Fast (compare to hive)
• Custom engine designed to
support SQL semantics
Source http://techblog.netflix.com/2014/10/usi ng-presto-in-our-big-data-platform.html

Presto Architecture
Source https://en.wikipedia.org/wiki/Presto_(SQL_query_engine)

Presto Cassandra Connector
• DataStream API:
▪ CQL
• DataLocation API
▪ Thrift: describe_ring and describe_splits_ex verbs
If number of partitions > 200
• Metadata API
▪ CQL: get table layout

Presto Cassandra Connector -
Configuration
• cassandra.contact-points
• cassandra.consistency-level
• cassandra.username / cassandra.password
• limit-for-partition-key-select
• cassandra.fetch-size
• cassandra.load-policy.use-dc-aware (default
false), dc-aware.local-dc
All: https://prestodb.io/docs/current/connector/cassandra.html

Scylla Example - CQL
CREATE KEYSPACE mykeyspace WITH REPLICATION =
{ 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
use mykeyspace ;
CREATE TABLE users
(user_id int PRIMARY KEY, fname text, lname text);
insert into users (user_id , fname, lname) values (1, 'tzach', 'livyatan');
insert into users (user_id , fname, lname) values (2, 'dor', 'laor');
insert into users (user_id , fname, lname) values (3, 'shlomi', 'laor');
insert into users (user_id , fname, lname) values (4, 'shlomi', 'livne');
insert into users (user_id , fname, lname) values (6, 'avi', 'kivity');

Scylla / Presto Example
CREATE TABLE air_quality_data (
... sensor_id text,
... time timestamp,
... co_ppm int,
... PRIMARY KEY (sensor_id, time)
... );
INSERT INTO air_quality_data(sensor_id, time, co_ppm) VALUES ('my_home',
'2016-08-30 07:01:00', 17);
INSERT INTO air_quality_data(sensor_id, time, co_ppm) VALUES ('my_home',
'2016-08-30 07:01:01', 18);

Scylla / Presto Example
> select histogram(co_ppm) as hist from
cassandra.mykeyspace.air_quality_data where sensor_id='my_home';
hist
--------------------------------
{17=1, 18=1, 19=1, 20=2, 31=1}
select sensor_id, avg(co_ppm) as AVG from
cassandra.mykeyspace.air_quality_data group by sensor_id;
sensor_id | avg
-----------+--------------------
your_home | 629.2857142857143
my_home | 20.833333333333332

Try for yourself!
docker run --name some-scylla-presto -d tzachl/scylla-and-presto-image
docker exec -it some-scylla-presto cqlsh
cqlsh>
docker exec -it some-scylla-presto ./presto --server localhost:8080 --
catalog cassandra --schema default
presto:default>

Presto SQL - Aggregate
Functions examples
• checksum - an order-insensitive checksum of the given values.
• count - number of input rows.
• count_if - number of TRUE input values
• max_by(x, y) - value of x associated with the maximum value of y over all
input values.
• histogram(x) - a map containing the count of the number of times each input
value occurs.
• approx_distinct(x) - approximate number of distinct input values. This
function provides an approximation of count(DISTINCT x). Zero is returned if
all input values are null.
• corr(y, x) - correlation coefficient of input values.
• stddev_samp(x) - sample standard deviation of all input values.

Airpal
Airbnb A web-based query execution tool built on top
of Facebook's PrestoDB.
Source https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR

Whats Next?
• Try it yourself https://hub.docker.com/r/tzachl/scylla-and-
presto-image
• Performance testing
• Future Optimizations
▪ Push query lower
▪ Pull data faster

Agenda
• What is Spark?
• Why Spark?
• Scylla + Spark
• What’s Next?

What if I don’t Know?
Find me at the happy hour tonight, we have beer and wine!

Why Spark and Scylla?
• Faster Analytics with In-memory execution
• Close to Real-Time analytics on transactional data
• Iterative
• Resource efficiency for multiple workloads

Spark Architecture
Source: http://spark.apache.or g/docs/latest/cluster-overview.html

Spark & Spark RDDs @ ScyllaDB
(Resilient Distributed Datasets)
• Understand your data model
CREATE TABLE sensordata (s1data text,s2data text,timestmp timestamp,sregion
text,stype text, PRIMARY KEY (s1data, s2data,timestmp));
CREATE TABLE sensordata (s1data text,s2data text,timestmp timestamp,sregion
text,stype text, PRIMARY KEY (timestmp, s1data, s2data));

Data Model Impact on Data Load
cassandra-stress user profile=myexample.yaml no-warmup ops(insert=1) n=1000000 -rate threads=1000 -node $node
PRIMARY KEY (s1data, s2data,timestmp)); PRIMARY KEY (timestmp, s1data, s2data));

Try Maintain Data Locality - Colo.

Data Locality - Dedicated Clusters
Review your settings for:
Spark.locality.wait.*
(node, process, rack)
→ Default is 3s
Network Speed and Latency

Spark & ScyllaDB, CPU Settings
@Scylla
@Spark
--cpu-set
SPARK_WORKER_CORES
--smp
• Divide system cores based on expected workload

Spark & Scylla, Memory Settings
(Resilient Distributed Datasets)
@Spark
SPARK_WORKER_MEMORY , per worker node
spark.executor.memory , When you set your specific application
memory consumption
@Scylla
Will just take whatever you give us :)

Spark Building and C* Connector
• Environment used in above examples:
▪ AWS i2.8xlarge x3 for Scylla & Spark (colo)
- Scylla 1.3, Spark 1.6.2
- Mvn 3.3.9
- Building Spark standalone cluster on EC2
▪ Each server has 32 cores
- 24 cores → Scylla
- 8 cores → Spark
▪ Each server has 244GB RAM
- 64GB for each Spark worker
- 128GB for Scylla
• Use Apache Cassandra connector Ver. 1.6

• Using Spark Standalone Cluster deployment
• Think Resources - CPU and Memory
• Better modeling, easier deployment, faster analytics
• Data locality can be a blessing if managed correctly
▪ Scylla’s optimal sharding enable data ingestion without
compromising analytics performance

Whats Next?
• Try it yourself
▪ Here is how to do it:
http://www.scylladb.com/kb/scylla-and-spark-
integration/
• Performance testing and use cases
Tell us about your experience with Scylla and Spark

Thank You!
Try Presto with Scylla now:
https://hub.docker.com/r/tzachl/scylla-
and-presto-image/
Contact: tzach@scylladb.com

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla

Similar to Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla