Scylla Summit 2018: Best Practices for Running Spark with Scylla

•Download as PPTX, PDF•

1 like•1,637 views

Spark and Scylla deployments are a common theme. Executing analytics workloads on transactional data provide insights to the business team. ETL workloads using Spark and Scylla are common too. We cover different workloads we have seen in practice and how we helped optimize both Spark and Scylla deployments to support a smooth and efficient workflow. Best practices we discuss include correctly sizing the Spark and Scylla nodes, tuning partitions sizes, setting connectors concurrency and Spark retry policies. In addition, we will cover ways to use Spark and Scylla in migrations from different data models.

Software

Best Practices for Running
Spark with Scylla
Eyal Gutkind - Head of Solutions Architects

Eyal Gutkind is head of solution architects at Scylla. Prior to
Scylla Eyal held product management roles at Mirantis and
DataStax. Prior to DataStax Eyal spent 12 years with Mellanox
Technologies in various engineering management and product
marketing roles. Eyal holds a BSc. degree in Electrical and
Computer Engineering from Ben Gurion University, Israel and
MBA from Fuqua School of Business at Duke University, North
Carolina.
Speaker

Scylla token architecture
source: http://docs.scylladb.com/architecture/ringarchitecture/

Spark and Spark partitions
source: https://spark.apache.org/docs/latest/cluster-overview.html

Spark and Spark partitions
Node 1
RDD1
Partition
1
RDD2
Partition
4
Node 2
RDD1
Partition
4
RDD2
Partition
2
Node 3
RDD1
Partition
2
RDD2
Partition
3
Node 4
RDD1
Partition
3
RDD2
Partition
1

8
Scylla to Spark, partition considerations
RDD 1 Partition 3
Pkey1 Col1 Col2 Col3
Col1 Col2 Col3Pkey2
Col1 Col2 Col3Pkey7342

The Cassandra-Spark connector
https://github.com/datastax/spark-cassandra-connector
▪ Provides Spark Context to data stored in Scylla/Cassandra
▪ Batch writes
▪ Read Scylla/Cassandra partitions to Spark Partitions
▪ Connection management between Scylla and Spark driver and
executors
▪ Utilizes the Cassandra Java driver

When Spark writes to Scylla
10
output.batch.grouping.buffer.size
output.batch.size.bytes
output.concurrent.writes
output.batch.grouping.key

When Spark reads from Scylla
11
input.split.size_in_mb
Don’t forget data is compressed on Disk!
Scylla paging capabilities will have an impact!
input.fetch.size_in_rows

▪ Increase default Spark parallelism (number
of cores in the Spark local machine deployment)
▪ Reduced Spark split size (64 -> 1)
▪ Connection.connections_per_executor_max
(# of core or more)
▪ Output.concurrent.writes default 5
▪ Concurrent.reads default is 512
Fine tuning Spark performance with Scylla

▪ Scylla enables analytics on top transactional data
▪ Performance tuning is required for certain workloads
▪ Resource management is key to stability of your deployment
Conclusion

Q&A
Stay in touch
Learn more
eyal@scylladb.com
@gutkinde
scylladb.com/blog
scylladb-users.slack.com

The document discusses software reliability in the era of big data and real-time processing. It describes how distributed systems like MapReduce and Spark improved reliability over expensive HPC clusters. Frameworks use in-memory computing, immutable data partitions, and checkpointing to tolerate failures. Distributed databases must address consensus and the CAP theorem. Real-time streaming requires techniques like windowing and watermarking to handle late data. The presentation concludes with an overview of a demo platform that collects industrial IoT data, performs real-time processing, and displays results.

Spark Powered by Scylla

ScyllaDB

Register to see webinar: http://go.scylladb.com/wbn-spark-scylla-registration.html Spark has become the de-facto analytics tool for data stored in Scylla. In this webinar we will review different workloads using Spark and Scylla, for example Extract, Transform, Load (ETL), creating joins between tables and summaries and reporting. We will also cover data modeling best practices for Scylla-Spark use cases and different deployment scenarios. To conclude, we will share performance tuning settings to utilize both Scylla and Spark at peak performance. Join us to learn... Why using Spark with Scylla is advantageous for analytics workloads How to create reporting using Spark and Scylla Best practices for data modeling and performance tuning for Scylla and Spark

Running Apache Spark on Kubernetes

DoKC

Apache Spark is one of the leading frameworks for analyzing data in realtime with Spark Structured Streaming. This talk will explain how to step by step run Spark structured streaming jobs on Kubernetes and access the Spark UI to see the jobs metrics and progress. This talk was given by Shardul Srivastava for DoK Day North America @ KubeCon 2021. Watch the talk: https://www.youtube.com/watch?v=WVfxVtTucA4&list=PLHgdNuGxrJt2twht-2suDZaV2zlptDJIf&index=13

State Of FPGA: Current & Future - A Panel discussion @ 4th FPGA Camp

FPGA Central

The panelists discussed their views on the current state and future of FPGAs. Mark expressed concerns about soft error rates and power densities in high-end FPGAs. Gordon said the industry is healthy but the race to more logic cells has pushed single architectures. Daniel said the line between FPGAs and ASICs is blurring and more niche and application-specific solutions will emerge. Chris said higher-level design flows will be needed to meet complexity demands. Umar viewed FPGAs as programmable processing engines well-suited for product architecture risk.

Looking Inside the MySQL 8.0 Document Store

Frederic Descamps

The document discusses MySQL 8.0 introducing a document store functionality with JSON documents while retaining the benefits of a traditional relational database like MySQL. It aims to provide developers with a flexible schemaless way of storing and querying data as objects like in NoSQL databases, while also offering features of relational databases like ACID transactions, reliability, and SQL capabilities. The document store is built on top of the existing MySQL server technology and uses the new JSON data type and X DevAPI, allowing documents to be stored and queried either with SQL or object-style APIs.

Aleksejs Nemirovskis - Manage your data using oracle BDA

Andrejs Vorobjovs

Manage Your Data, Using Oracle Big Data Appliance - Tips & Tricksngest, process and manage the data, using Oracle Big Data Appliance (end-to-end BigData solution from Oracle): - Oracle BDA architecture and componets overview - Oracle platform, Cloudera CDH, Clodera Manager and specific Oracle components; - Advantages and additional value of an Oracle BDA; - Challenges, faced inside whole stack (BDA, Cloudera); - Challenges, which came from original Hadoop EcoSystem; - Customer case (anonymized): how to utilize a power of an Oracle BDA, including external Informatica Big Data Management tool.

SM-re-ex1

Subrata Mandal

This document contains the resume of Subrata Mandal. It summarizes his objective to seek a leadership position in the semiconductor industry, and outlines his 20 years of experience in chip design and post-silicon validation at Intel, including managing teams and holding patents. It provides details of his roles leading I/O design and validation on several Intel microprocessor and chipset projects.

BigDL is a distributed deep Learning framework built for Big Data platform using Apache Spark. It combines the benefits of “high performance computing” and “Big Data” architecture, providing native support for deep learning functionalities in Spark, orders of magnitude speedup than out-of-box open source DL frameworks (e.g., Caffe/Torch) wrt single node performance (by leveraging Intel MKL), and the scale-out of deep learning workloads based on the Spark architecture. We’ll also share how our users adopt BigDL for their deep learning applications (such as image recognition, object detection, NLP, etc.), which allows them to use their Big Data (e.g., Apache Hadoop and Spark) platform as the unified data analytics platform for data storage, data processing and mining, feature engineering, traditional (non-deep) machine learning, and deep learning workloads.

Introduction to Spark: Or how I learned to love 'big data' after all.

Peadar Coyle

Pi Day 2022 - from IoT to MySQL HeatWave Database Service

Frederic Descamps

HeatWave is a massively parallel, high performance, in-memory query accelerator for Oracle MySQL Database Service that accelerates MySQL performance by orders of magnitude for analytics and mixed workloads. But how do you collect data from an Internet of Things Environment so you can use HeatWave to process it? In one hour you will see how data collected by a Raspberry PI or other Internet of Things device can be uploaded to the MySQL Database Service and then processed by HeatWave.

hjsklar CV

hjsklar

This document provides a summary of Horace Sklar's technical experience as a project manager, principal engineer, and technical consultant. Over his 35+ year career, he has specialized in digital signal and image processing system design for military and defense applications. Some of his areas of expertise include FPGA and ASIC design, hardware/software integration, requirements analysis, and project management. He has worked on numerous defense contracts involving satellite payloads, airborne systems, and ground-based detection systems.

Distributed Deep Learning At Scale On Apache Spark With BigDL

Yulia Tell

This document provides an agenda and details for a co-hosted meetup between Intel and Databricks on March 23, 2017 about BigDL. The agenda includes opening remarks, two tech talks (one from Intel and one from Databricks), and a mingling session. It also provides WiFi access details and background on Intel's Big Data Technologies group and BigDL. BigDL is an open-source distributed deep learning library for Apache Spark that allows users to run deep learning applications on Spark.

FPGAs – CHRONOLOGICAL DEVELOPMENTS AND CHALLENGES

IAEME Publication

The Field Programmable Gate Array (FPGA) industry is expanding both in market share and in innovation. The tailored FPGA features make them a better choice to include FPGA in an increasing number of applications in the upcoming years. A constant development of FPGA technology has led to minimize the gap of performance levels between FPGA and Application Specific Integrated Circuit (ASIC). Hence, in recent years, FPGA based platforms are proven more attractive than ASICs since their performance is high in addition to the low cost of the development process and short time to market. Therefore, nowadays, FPGA is highly attractive for a huge range of applications in communications, computing, avionics, security, automotive and consumer electronics. Field Programmable Gate Array industry has shown a steady growth with a market prediction value of USD 9 billion by 2023. Currently, the FPGA companies started growing in reserch areas such as Artifitial Intelligence (AI), Internet of Thing (IoT) and LIght Detection and Ranging (LIDAR). The aim of this paper is to review the developments in FPGA.

Addressing the High Cost of Apache Cassandra

ScyllaDB

Is your Cassandra deployment size out of control? * Do you get constant requests to source more nodes to sustain your NoSQL workload? * Do you need to put an external cache in front of your database to ensure performance? * Is managing your Cassandra clusters too time-consuming and expensive -- either from your own staff or the high price you’re paying your DBaaS vendor? In this webinar, we’ll dive into the myths about Cassandra ownership costs and the pitfalls that come with it. We’ll show that using modern design techniques, simplified tuning and a scalable datastore can help you control your Total Cost of Ownership (TCO). Eyal Gutkind, our VP of Solution Engineering, will walk you though: Primary and secondary considerations for evaluating the effectiveness of your data platform The correlation between use cases and deployment costs How you know it's time to migrate and why Cassandra users! This is a must-attend session for you!! We will show you ways to gauge your Cassandra overspend.

Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...

confluent

This document provides an overview of a webinar on driving business transformation with real-time analytics using Apache Kafka and KSQL. The webinar features presentations from Nick Dearden of Confluent, John Thuma of Arcadia Data, and Thomas Clarke of RCG Global Services. It discusses how Kafka and KSQL can be used together to enable real-time data processing and analytics. It also highlights how Arcadia Data provides a BI tool for KSQL that allows for easy drag-and-drop dashboarding on streaming data. RCG then discusses its approach to digital transformation and data architecture services. The webinar concludes with a Q&A section.

11회 Oracle Developer Meetup 발표 자료: Oracle NoSQL (2019.05.18) oracle-nosql pu...

Taewan Kim

Splunk PNW User Group - Seattle - 2023-06-28.pdf

Amanda Richardson

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...

Spark Summit

Deep learning is a fast growing subset of machine learning. There is an emerging trend to conduct deep learning in the same cluster along with existing data processing pipelines to support feature engineering and traditional machine learning. As the leading framework for Distributed ML, we believe that the addition of deep learning to the super-popular Spark framework is important, because it allows Spark developers to perform a range of data analysis tasks within a single framework that helps avoid the complexity inherent in using multiple frameworks and libraries. As one of the early and top contributors to Apache Spark, Intel is thrilled to share with the community a big deal contribution to open source Spark…”BigDL” -… A distributed deep Learning framework organically built on Big Data (Apache Spark) platform. It combines the benefits of “high performance computing” and “Big Data” architecture for rich deep learning support. With BigDL on Spark, customers can eliminate large volume of unnecessary dataset transfer between separate systems, eliminate separate HW clusters and move towards a CPU cluster, reduce system complexity and the latency for end-to-end learning. Ultimately, customers can achieve better scale, higher resource utilization, ease of use/development, and better TCO. Feature parity with Caffe and Torch, significant performance boost when combined with Intel’s Math Kernel Library (MKL), scale-out, fault tolerance, elasticity and dynamic resource sharing are some of the prominent features of BigDL. BigDL open source project will be launched at the 2017 Spark Summit East and this keynote will help spotlight this new contribution and benefits to the Spark developer community and encourage their wide contribution and collaboration. We will also showcase some real world applications of Big DL from early customers’ adoption.

Enterprise data science - What it takes to build?

Jothi Periasamy

Enterprise data science is not just creating dashboard, reports, ad-hoc query, models and/or algorithms, it’s beyond all - Take a look at our approach to enterprise data sciences, it’ very complex and it’s very difficult to implement as it’s involved integrating data across enterprise business function regardless of data source, format and structure There are many instances where people talk about enterprise data sciences (Oracle 12C, HADOOP, SAP) but “have you seen enterprise data sciences in a real system as a live demo”, in most cases the answers is “no” but now there is an opportunity to review enterprise data sciences with CloneSkills. I would say confidently say that there is no one in the world who integrated “Oracle 12C” and SAP HANA with HADOOP for real-time data integration except CloneSkills technical architect Mr. Karthik

GuidoBonelli

Guido Bonelli

Getting Started with Spark Scala

Knoldus Inc.

Spark + i python

Guillermo Blasco Jiménez

The document discusses combining Spark and IPython for distributed computing. Spark is a distributed computation engine that runs on a cluster, while IPython provides an interactive computing environment. The goal is to connect IPython to a Spark cluster so developers can process large datasets interactively. This allows processing data at scale during development and easily exporting code to production. The document provides an overview of Spark and IPython architectures and demonstrates connecting an IPython kernel to a Spark context to develop Spark scripts interactively.

DGX Sessions You Won't Want to Miss at GTC 2019

NVIDIA

OOW19 - HOL5221

Bobby Curtis

Oracle GoldenGate provides data integration and replication capabilities. The presentation discusses Oracle GoldenGate's microservices architecture which enables faster deployments. It highlights key use cases such as database high availability, OLTP replication, data warehouse loading, and stream analytics. The presentation also outlines Oracle GoldenGate's continued investment in areas like security, performance, and support for Oracle Database 19c.

Workshop - How to benchmark your database

ScyllaDB

Why you need benchmarks Finding the right database solution for your use case can be an arduous journey. The database deployment touches aspects of throughput performance, latency control, high availability and data resilience. You will need to decide on the infrastructure to use: Cloud, on-premise or a hybrid solution. Data models also have an impact on finding the right fit for the use case. Once you establish a requirements set, the next step is to test your use case against the databases of choice. In this workshop, we will discuss the different data points you need to collect in order to get the most realistic testing environment. We will cover: Data model impact on performance and latency Client behavior related to database capabilities Failover and high availability testing Hardware selection and cluster configuration impact We will show 2 benchmarking tools you can use to test and benchmark your clusters to identify the optimal deployment scenario for your use case. Attend this virtual workshop if you are: Looking to minimize the cost of your database deployment Making a database decision based on performance and scale data Planning to emulate your workload on a pre-production system where you can test, fail fast and learn.

FPGA based mini Project.pptx

SatyabratBordoloi2

This power point presentation summarizes an FPGA mini project. It introduces FPGA components and the Mimas Spartan 6 FPGA development board used. It describes the software tools Oracle VirtualBox and Xilinx ISE Design Suite. Experiments included blinking LEDs and designing a smart digital locker controller circuit in Verilog code. In conclusion, the presenters learned FPGA implementation and gained experience with digital circuits and Verilog coding.

Resume ch2015

Crawford Hoss

This document is a professional resume for Crawford L. Hoss, a senior PCB designer. It summarizes his education including graduating from high school and completing a business degree at Pierce College. It details over 30 years of experience in PCB design across various industries including aerospace, RF, and medical devices. It provides an overview of the types of designs and CAD tools he has experience with and lists relevant employment history and projects completed at each position.

Optimizing NoSQL Performance Through Observability

ScyllaDB

ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. But before you squeeze, make sure you know what to monitor! Watch our experienced Postgres developer work through monitoring and performance strategies that help him understand what mistakes he’s made moving to NoSQL. And learn with him as our database performance expert offers friendly guidance on how to use monitoring and performance tuning to get his sample Rust application on the right track. This webinar focuses on using monitoring and performance tuning to discover and correct mistakes that commonly occur when developers move from SQL to NoSQL. For example: - Common issues getting up and running with the monitoring stack - Using the CQL optimizations dashboard - Common issues causing high latency in a node - Common issues causing replica imbalance - What a healthy system looks like in terms of memory - Key metrics to keep an eye on This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.

Event-Driven Architecture Masterclass: Challenges in Stream Processing

ScyllaDB

Similar to Scylla Summit 2018: Best Practices for Running Spark with Scylla

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...

Spark Summit

Introduction to Spark: Or how I learned to love 'big data' after all.

Peadar Coyle

Pi Day 2022 - from IoT to MySQL HeatWave Database Service

Frederic Descamps

hjsklar CV

hjsklar

Distributed Deep Learning At Scale On Apache Spark With BigDL

Yulia Tell

FPGAs – CHRONOLOGICAL DEVELOPMENTS AND CHALLENGES

IAEME Publication

Addressing the High Cost of Apache Cassandra

ScyllaDB

Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...

confluent

11회 Oracle Developer Meetup 발표 자료: Oracle NoSQL (2019.05.18) oracle-nosql pu...

Taewan Kim

Splunk PNW User Group - Seattle - 2023-06-28.pdf

Amanda Richardson

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...

Spark Summit

Enterprise data science - What it takes to build?

Jothi Periasamy

GuidoBonelli

Guido Bonelli

Getting Started with Spark Scala

Knoldus Inc.

Spark + i python

Guillermo Blasco Jiménez

DGX Sessions You Won't Want to Miss at GTC 2019

NVIDIA

OOW19 - HOL5221

Bobby Curtis

Workshop - How to benchmark your database

ScyllaDB

FPGA based mini Project.pptx

SatyabratBordoloi2

Resume ch2015

Crawford Hoss

Similar to Scylla Summit 2018: Best Practices for Running Spark with Scylla (20)

BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...

Introduction to Spark: Or how I learned to love 'big data' after all.

Pi Day 2022 - from IoT to MySQL HeatWave Database Service

hjsklar CV

Distributed Deep Learning At Scale On Apache Spark With BigDL

FPGAs – CHRONOLOGICAL DEVELOPMENTS AND CHALLENGES

Addressing the High Cost of Apache Cassandra

Driving Business Transformation with Real-Time Analytics Using Apache Kafka a...

11회 Oracle Developer Meetup 발표 자료: Oracle NoSQL (2019.05.18) oracle-nosql pu...

Splunk PNW User Group - Seattle - 2023-06-28.pdf

Accelerating Machine Learning and Deep Learning At Scale...With Apache Spark:...

Enterprise data science - What it takes to build?

GuidoBonelli

Getting Started with Spark Scala

Spark + i python

DGX Sessions You Won't Want to Miss at GTC 2019

OOW19 - HOL5221

Workshop - How to benchmark your database

FPGA based mini Project.pptx

Resume ch2015

More from ScyllaDB

Optimizing NoSQL Performance Through Observability

ScyllaDB

Event-Driven Architecture Masterclass: Challenges in Stream Processing

ScyllaDB

Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...

ScyllaDB

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

ScyllaDB

Developer Data Modeling Mistakes: From Postgres to NoSQL

ScyllaDB

See where an RDBMS-pro’s intuition leads him astray – and learn practical tips for the data modeling transition ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. However, developers new to high-performance NoSQL intuitively shoot themselves in the foot with respect to things like table design, query design, indexing, and partitioning. Watch where our experienced Postgres developer intuitively falls into traps that hurt performance and scalability. And learn with him as our database performance expert offers friendly guidance on navigating all the unexpected behaviors that tend to trip up RDBMS experts. This webinar focuses on common data modeling and querying mistakes that occur when developers move from SQL to NoSQL. For example: - Understanding query first design principles - Planning for schema evolution - Steering clear of common pitfalls and anti-patterns - Assessing data access patterns This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.

What Developers Need to Unlearn for High Performance NoSQL

ScyllaDB

See where an RDBMS-pro’s intuition leads him astray – and learn practical tips for the transition ScyllaDB has the potential to deliver impressive performance and scalability. The better you understand how it works, the more you can squeeze out of it. However, developers new to high-performance NoSQL intuitively shoot themselves in the foot with respect to things like table design, query design, indexing, and partitioning. Watch where our experienced Postgres developer intuitively falls into traps that hurt performance and scalability. And learn with him as our database performance expert offers friendly guidance on navigating all the unexpected behaviors that tend to trip up RDBMS experts. Our first webinar of this series will cover common mistakes with practices such as: - Translating the data model to NoSQL - Optimizing table design - Optimizing query performance - Planning for partitioning This isn’t “Death-by-Powerpoint.” We’ll walk through problems encountered while migrating a real application from Postgres to ScyllaDB – and try to fix them live as well.

Low Latency at Extreme Scale: Proven Practices & Pitfalls

ScyllaDB

Expert tips on how to maximize your database performance at scale Untangle the complexity of achieving database performance at scale. Join this webinar to discover commonly overlooked ways to get predictable low latency, even at extreme scale. Our Solution Architects will walk you through the strategies and pitfalls learned by working on thousands of real-world distributed database projects, many reaching 1M OPS with single-digit MS latencies. In addition to offering clear recommendations, we’ll also explain the process behind how we arrived at them – so you can benefit from the lessons learned by other teams. We’ll cover how to: - Design and deploy a large-scale distributed database cluster - Optimize your clients’ interactions with it - Expand the cluster horizontally and globally - Ensure it survives whatever disasters the world throws at it

Dissecting Real-World Database Performance Dilemmas

ScyllaDB

Tackling your own database performance challenges is serious business. For a change of pace, let’s have some fun learning from other teams’ performance predicaments. Join us for an interactive session where we dissect four specific database performance challenges faced by teams considering or using ScyllaDB. For each dilemma, we'll: - Examine the context and technical requirements - Talk about potential solutions and cover the pros and cons of each - Disclose what approach the team took, and how it worked out About the speaker: Felipe is an IT specialist with years of experience on distributed systems and open-source technologies. He is one of the co-authors of "Database Performance at Scale", an Open Access, freely available publication for individuals interested on improving database performance. At ScyllaDB, he works as a Solution Architect.

Beyond Linear Scaling: A New Path for Performance with ScyllaDB

ScyllaDB

Linear scaling (sometimes near linear scaling) is often mentioned in several benchmarks, articles and product comparisons as proof that a given technology and algorithmic optimizations perform better than another. But is that really what performance is all about, and should you even care? This webinar discusses performance beyond linear scalability, including what typically matters more when running high throughput and low latency workloads at scale. We'll cover how ScyllaDB offers unparalleled performance and share our insights on: - The hidden aspects of linear scaling - When linear scaling matters most and when it’s simply irrelevant - Often overlooked considerations for optimizing and measuring distributed systems performance Watch now to learn from our experience (and lessons learned) in building the fastest NoSQL database in the world.

Dissecting Real-World Database Performance Dilemmas

ScyllaDB

Navigating Complex Database Performance Hurdles Tackling your own database performance challenges is serious business. For a change of pace, let’s have some fun learning from other teams’ performance predicaments. Join us for an interactive session where we dissect 4 specific database performance challenges faced by teams considering or using ScyllaDB. For each dilemma: - The presenters will describe the context and technical requirements - Together, we’ll talk about potential solutions and cover the pros and cons of each - Finally, we’ll disclose what approach the team took, and how it worked out Throughout the event, we’ll have opportunities to win ScyllaDB swag and prizes! Come prepared to engage in lively discussions and gain valuable insight into database performance strategies.

Database Performance at Scale Masterclass: Workload Characteristics by Felipe...

ScyllaDB

Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...

ScyllaDB

Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna

ScyllaDB

Replacing Your Cache with ScyllaDB

ScyllaDB

This document discusses replacing external caching solutions with using the internal caching capabilities of ScyllaDB. It provides examples of companies that improved performance, reduced costs and complexity by moving from Redis or Elasticsearch with an external cache to using ScyllaDB's embedded cache instead. The document also outlines some of the advantages of ScyllaDB's cache like improved latency, coherency with the database and observability compared to external caching layers.

Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability

ScyllaDB

Discover how your team can achieve low latency at the extreme scale that your data-intensive applications require. We’ll walk you through an example of how ScyllaDB scales linearly to achieve 1M and then 2M OPS – with <1ms P99 latency. We’ll cover how this works on a sample realtime app (an ML feature store), share best practices for performance, and talk about the most important tradeoffs you’ll need to negotiate. Join us to learn: - Why and how to ensure your database takes full advantage of your cloud infrastructure - What architectural considerations matter most for high throughput and low latency - Key factors to consider when selecting a high-performance database

7 Reasons Not to Put an External Cache in Front of Your Database.pptx

ScyllaDB

This document discusses the pros and cons of placing an external cache in front of a database. It introduces Tomasz Grabiec and Tzach Livyatan from ScyllaDB and describes ScyllaDB's optimized internal caching design. External caches can increase latency and costs while ignoring the database's context and workload knowledge. ScyllaDB embeds its cache to minimize overhead and ensure data and query awareness. The document shares customer examples that improved performance and reduced costs by moving from cached databases to ScyllaDB.

Getting the most out of ScyllaDB

ScyllaDB

Expert tips on how to maximize your database potential If you’re considering or getting started with ScyllaDB, you’re probably intrigued by its potential to achieve high throughput and predictable low latency at a reasonable cost. So how do you ensure that you’re maximizing that potential for your team’s specific workloads and use case? This webinar offers practical advice for navigating the various decision points you’ll face as you assess whether ScyllaDB is a good fit for your team and later roll it out into production. We’ll cover the most critical considerations, tradeoffs, and recommendations related to: - Infrastructure selection - ScyllaDB configuration - Client-side setup - Data modeling

NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration

ScyllaDB

NoSQL Database Migration Masterclass - Session 3: Migration Logistics

ScyllaDB

NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges

ScyllaDB

More from ScyllaDB (20)

Optimizing NoSQL Performance Through Observability

Event-Driven Architecture Masterclass: Challenges in Stream Processing

Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...

Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...

Developer Data Modeling Mistakes: From Postgres to NoSQL

What Developers Need to Unlearn for High Performance NoSQL

Low Latency at Extreme Scale: Proven Practices & Pitfalls

Dissecting Real-World Database Performance Dilemmas

Beyond Linear Scaling: A New Path for Performance with ScyllaDB

Dissecting Real-World Database Performance Dilemmas

Database Performance at Scale Masterclass: Workload Characteristics by Felipe...

Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...

Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna

Replacing Your Cache with ScyllaDB

Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability

7 Reasons Not to Put an External Cache in Front of Your Database.pptx

Getting the most out of ScyllaDB

NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration

NoSQL Database Migration Masterclass - Session 3: Migration Logistics

NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges

Recently uploaded

GraphSummit Paris - The art of the possible with Graph Technology

Neo4j

原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样

mz5nrf0n

原版一模一样【微信：741003700 】【美国纽约州立大学奥尔巴尼分校毕业证学位证书】【微信：741003700 】学位证，留信认证（真实可查，永久存档）offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原海外各大学 Bachelor Diploma degree, Master Degree Diploma 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Fundamentals of Programming and Language Processors

Rakesh Kumar R

Artificia Intellicence and XPath Extension Functions

Octavian Nadolu

Empowering Growth with Best Software Development Company in Noida - Deuglo

Deuglo Infosystem Pvt Ltd

Do you want Software for your Business? Visit Deuglo Deuglo has top Software Developers in India. They are experts in software development and help design and create custom Software solutions. Deuglo follows seven steps methods for delivering their services to their customers. They called it the Software development life cycle process (SDLC). Requirement — Collecting the Requirements is the first Phase in the SSLC process. Feasibility Study — after completing the requirement process they move to the design phase. Design — in this phase, they start designing the software. Coding — when designing is completed, the developers start coding for the software. Testing — in this phase when the coding of the software is done the testing team will start testing. Installation — after completion of testing, the application opens to the live server and launches! Maintenance — after completing the software development, customers start using the software.

8 Best Automated Android App Testing Tool and Framework in 2024.pdf

kalichargn70th171

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

Crescat

Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry. Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events. With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use. Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements. If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io

UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem

Peter Muessig

Oracle 23c New Features For DBAs and Developers.pptx

Remote DBA Services

Unveiling the Advantages of Agile Software Development.pdf

brainerhub1

E-commerce Application Development Company.pdf

Hornet Dynamics

SWEBOK and Education at FUSE Okinawa 2024

Hironori Washizaki

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

Green Software Development

Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition

Envertis Software Solutions

Odoo ERP software Odoo ERP software, a leading open-source software for Enterprise Resource Planning (ERP) and business management, has recently launched its latest version, Odoo 17 Community Edition. This update introduces a range of new features and enhancements designed to streamline business operations and support growth. The Odoo Community serves as a cost-free edition within the Odoo suite of ERP systems. Tailored to accommodate the standard needs of business operations, it provides a robust platform suitable for organisations of different sizes and business sectors. Within the Odoo Community Edition, users can access a variety of essential features and services essential for managing day-to-day tasks efficiently. This blog presents a detailed overview of the features available within the Odoo 17 Community edition, and the differences between Odoo 17 community and enterprise editions, aiming to equip you with the necessary information to make an informed decision about its suitability for your business.

Using Xen Hypervisor for Functional Safety

Ayan Halder

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

Peter Muessig

The UI5 tooling is the development and build tooling of UI5. It is built in a modular and extensible way so that it can be easily extended by your needs. This session will showcase various tooling extensions which can boost your development experience by far so that you can really work offline, transpile your code in your project to use even newer versions of EcmaScript (than 2022 which is supported right now by the UI5 tooling), consume any npm package of your choice in your project, using different kind of proxies, and even stitching UI5 projects during development together to mimic your target environment.

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Quickdice ERP

Using Query Store in Azure PostgreSQL to Understand Query Performance

Grant Fritchey

Oracle Database 19c New Features for DBAs and Developers.pptx

Remote DBA Services

Atelier - Innover avec l’IA Générative et les graphes de connaissances

Neo4j

Atelier - Innover avec l’IA Générative et les graphes de connaissances Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement. Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.

Recently uploaded (20)

GraphSummit Paris - The art of the possible with Graph Technology

原版定制美国纽约州立大学奥尔巴尼分校毕业证学位证书原版一模一样

Fundamentals of Programming and Language Processors

Artificia Intellicence and XPath Extension Functions

Empowering Growth with Best Software Development Company in Noida - Deuglo

8 Best Automated Android App Testing Tool and Framework in 2024.pdf

Introducing Crescat - Event Management Software for Venues, Festivals and Eve...

UI5con 2024 - Keynote: Latest News about UI5 and it’s Ecosystem

Oracle 23c New Features For DBAs and Developers.pptx

Unveiling the Advantages of Agile Software Development.pdf

E-commerce Application Development Company.pdf

SWEBOK and Education at FUSE Okinawa 2024

ALGIT - Assembly Line for Green IT - Numbers, Data, Facts

Why Choose Odoo 17 Community & How it differs from Odoo 17 Enterprise Edition

Using Xen Hypervisor for Functional Safety

UI5con 2024 - Boost Your Development Experience with UI5 Tooling Extensions

E-Invoicing Implementation: A Step-by-Step Guide for Saudi Arabian Companies

Using Query Store in Azure PostgreSQL to Understand Query Performance

Oracle Database 19c New Features for DBAs and Developers.pptx

Atelier - Innover avec l’IA Générative et les graphes de connaissances

Scylla Summit 2018: Best Practices for Running Spark with Scylla

1. Best Practices for Running Spark with Scylla Eyal Gutkind - Head of Solutions Architects

2. Eyal Gutkind is head of solution architects at Scylla. Prior to Scylla Eyal held product management roles at Mirantis and DataStax. Prior to DataStax Eyal spent 12 years with Mellanox Technologies in various engineering management and product marketing roles. Eyal holds a BSc. degree in Electrical and Computer Engineering from Ben Gurion University, Israel and MBA from Fuqua School of Business at Duke University, North Carolina. Speaker

3. Analytics

4. Scylla token architecture source: http://docs.scylladb.com/architecture/ringarchitecture/

5. Scylla token architecture source: http://docs.scylladb.com/architecture/ringarchitecture/

6. Spark and Spark partitions source: https://spark.apache.org/docs/latest/cluster-overview.html

7. Spark and Spark partitions Node 1 RDD1 Partition 1 RDD2 Partition 4 Node 2 RDD1 Partition 4 RDD2 Partition 2 Node 3 RDD1 Partition 2 RDD2 Partition 3 Node 4 RDD1 Partition 3 RDD2 Partition 1

8. 8 Scylla to Spark, partition considerations RDD 1 Partition 3 Pkey1 Col1 Col2 Col3 Col1 Col2 Col3Pkey2 Col1 Col2 Col3Pkey7342

9. The Cassandra-Spark connector https://github.com/datastax/spark-cassandra-connector ▪ Provides Spark Context to data stored in Scylla/Cassandra ▪ Batch writes ▪ Read Scylla/Cassandra partitions to Spark Partitions ▪ Connection management between Scylla and Spark driver and executors ▪ Utilizes the Cassandra Java driver

10. When Spark writes to Scylla 10 output.batch.grouping.buffer.size output.batch.size.bytes output.concurrent.writes output.batch.grouping.key

11. When Spark reads from Scylla 11 input.split.size_in_mb Don’t forget data is compressed on Disk! Scylla paging capabilities will have an impact! input.fetch.size_in_rows

12. To collocate or not to collocate?

13. ▪ Increase default Spark parallelism (number of cores in the Spark local machine deployment) ▪ Reduced Spark split size (64 -> 1) ▪ Connection.connections_per_executor_max (# of core or more) ▪ Output.concurrent.writes default 5 ▪ Concurrent.reads default is 512 Fine tuning Spark performance with Scylla

14. ▪ Scylla enables analytics on top transactional data ▪ Performance tuning is required for certain workloads ▪ Resource management is key to stability of your deployment Conclusion

15. Q&A Stay in touch Learn more eyal@scylladb.com @gutkinde scylladb.com/blog scylladb-users.slack.com

Scylla Summit 2018: Best Practices for Running Spark with Scylla

Recommended

Recommended

More Related Content

Similar to Scylla Summit 2018: Best Practices for Running Spark with Scylla

Similar to Scylla Summit 2018: Best Practices for Running Spark with Scylla (20)

More from ScyllaDB

More from ScyllaDB (20)

Recently uploaded

Recently uploaded (20)

Scylla Summit 2018: Best Practices for Running Spark with Scylla