2. Agenda
• What is Presto?
• Why Presto?
• Scylla + Presto
▸ Connector
▸ Examples
• What’s Next?
3. What is Presto?
• Distributed ANSI SQL Query Engine for Big Data
• Developed by Facebook in 2012
• Data sources: HDFS, S3, Cassandra, MySQL, Kafka,
PostgreSQL, Redis and Scylla
• Open Source
• Java Base
5. Why Presto?
• ANSI SQL
• Extensible - multiple data
sources
• Fast (compare to hive)
• Custom engine designed to
support SQL semantics
Source http://techblog.netflix.com/2014/10/usi ng-presto-in-our-big-data-platform.html
8. Presto Cassandra Connector
• DataStream API:
▪ CQL
• DataLocation API
▪ Thrift: describe_ring and describe_splits_ex verbs
If number of partitions > 200
• Metadata API
▪ CQL: get table layout
16. Presto SQL - Aggregate
Functions examples
• checksum - an order-insensitive checksum of the given values.
• count - number of input rows.
• count_if - number of TRUE input values
• max_by(x, y) - value of x associated with the maximum value of y over all
input values.
• histogram(x) - a map containing the count of the number of times each input
value occurs.
• approx_distinct(x) - approximate number of distinct input values. This
function provides an approximation of count(DISTINCT x). Zero is returned if
all input values are null.
• corr(y, x) - correlation coefficient of input values.
• stddev_samp(x) - sample standard deviation of all input values.
17. Airpal
Airbnb A web-based query execution tool built on top
of Facebook's PrestoDB.
Source https://blogs.aws.amazon.com/bigdata/post/Tx1BF2DN6KRFI27/Analyze-Data-with-Presto-and-Airpal-on-Amazon-EMR
21. What if I don’t Know?
Find me at the happy hour tonight, we have beer and wine!
22. Why Spark and Scylla?
• Faster Analytics with In-memory execution
• Close to Real-Time analytics on transactional data
• Iterative
• Resource efficiency for multiple workloads
27. Data Locality - Dedicated Clusters
Review your settings for:
Spark.locality.wait.*
(node, process, rack)
→ Default is 3s
Network Speed and Latency
28. Spark & ScyllaDB, CPU Settings
@Scylla
@Spark
--cpu-set
SPARK_WORKER_CORES
--smp
• Divide system cores based on expected workload
29. Spark & Scylla, Memory Settings
(Resilient Distributed Datasets)
@Spark
SPARK_WORKER_MEMORY , per worker node
spark.executor.memory , When you set your specific application
memory consumption
@Scylla
Will just take whatever you give us :)
30. Spark Building and C* Connector
• Environment used in above examples:
▪ AWS i2.8xlarge x3 for Scylla & Spark (colo)
- Scylla 1.3, Spark 1.6.2
- Mvn 3.3.9
- Building Spark standalone cluster on EC2
▪ Each server has 32 cores
- 24 cores → Scylla
- 8 cores → Spark
▪ Each server has 244GB RAM
- 64GB for each Spark worker
- 128GB for Scylla
• Use Apache Cassandra connector Ver. 1.6
31. • Using Spark Standalone Cluster deployment
• Think Resources - CPU and Memory
• Better modeling, easier deployment, faster analytics
• Data locality can be a blessing if managed correctly
▪ Scylla’s optimal sharding enable data ingestion without
compromising analytics performance
32. Whats Next?
• Try it yourself
▪ Here is how to do it:
http://www.scylladb.com/kb/scylla-and-spark-
integration/
• Performance testing and use cases
Tell us about your experience with Scylla and Spark