Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
Spark and Cassandra with the Datastax Spark Cassandra Connector
How it works and how to use it!
Missed Spark Summit but Still want to see some slides?
This slide deck is for you!
Beyond the Query – Bringing Complex Access Patterns to NoSQL with DataStax - ...StampedeCon
Learn how to model beyond traditional direct access in Apache Cassandra. Utilizing the DataStax platform to harness the power of Spark and Solr to perform search, analytics, and complex operations in place on your Cassandra data!
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets onCassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Time series with Apache Cassandra - Long versionPatrick McFadin
Apache Cassandra has proven to be one of the best solutions for storing and retrieving time series data. This talk will give you an overview of the many ways you can be successful. We will discuss how the storage model of Cassandra is well suited for this pattern and go over examples of how best to build data models.
Analyzing Time-Series Data with Apache Spark and Cassandra - StampedeCon 2016StampedeCon
Have you ever wanted to analyze sensor data that arrives every second from across the world? Or maybe your want to analyze intra-day trading prices of millions of financial instruments? Or take all the page views from Wikipedia and compare the hourly statistics? To do this or any other similar analysis, you will need to analyze large sequences of measurements over time. And what better way to do this then with Apache Spark? In this session we will dig into how to consume data, and analyze it with Spark, and then store the results in Apache Cassandra.
Frustration-Reduced Spark: DataFrames and the Spark Time-Series LibraryIlya Ganelin
In this talk I talk about my recent experience working with Spark Data Frames and the Spark TimeSeries library. For data frames, the focus will be on usability. Specifically, a lot of the documentation does not cover common use cases like intricacies of creating data frames, adding or manipulating individual columns, and doing quick and dirty analytics. For the time series library, I dive into the kind of use cases it supports and why it’s actually super useful.
Using Spark to Load Oracle Data into CassandraJim Hatcher
This presentation describes how you can use Spark as an ETL tool to get data from a relational database into Cassandra. I go through the concept in general and then talk about some specific issues you might run into and how to fix them.
Maximum Overdrive: Tuning the Spark Cassandra Connector (Russell Spitzer, Dat...DataStax
Worried that you aren't taking full advantage of your Spark and Cassandra integration? Well worry no more! In this talk we'll take a deep dive into all of the available configuration options and see how they affect Cassandra and Spark performance. Concerned about throughput? Learn to adjust batching parameters and gain a boost in speed. Always running out of memory? We'll take a look at the various causes of OOM errors and how we can circumvent them. Want to take advantage of Cassandra's natural partitioning in Spark? Find out about the recent developments that let you perform shuffle-less joins on Cassandra-partitioned data! Come with your questions and problems and leave with answers and solutions!
About the Speaker
Russell Spitzer Software Engineer, DataStax
Russell Spitzer received a Ph.D in Bio-Informatics before finding his deep passion for distributed software. He found the perfect outlet for this passion at DataStax where he began on the Automation and Test Engineering team. He recently moved from finding bugs to making bugs as part of the Analytics team where he works on integration between Cassandra and Spark as well as other tools.
Lessons from Cassandra & Spark (Matthias Niehoff & Stephan Kepser, codecentri...DataStax
We built an application based on the principles of CQRS and Event Sourcing using Cassandra and Spark. During the project we encountered a number of challenges and problems with Cassandra and the Spark Connector.
In this talk we want to outline a few of those problems and our actions to solve them. While some problems are specific to CQRS and Event Sourcing applications most of them are use case independent.
About the Speakers
Matthias Niehoff IT-Consultant, codecentric AG
works as an IT-Consultant at codecentric AG in Germany. His focus is on big data & streaming applications with Apache Cassandra & Apache Spark. Yet he does not lose track of other tools in the area of big data. Matthias shares his experiences on conferences, meetups and usergroups.
Stephan Kepser Senior IT Consultant and Data Architect, codecentric AG
Dr. Stephan Kepser is an expert on cloud computing and big data. He wrote a couple of journal articles and blog posts on subjects of both fields. His interests reach from legal questions to questions of architecture and design of cloud computing and big data systems to technical details of NoSQL databases.
How We Used Cassandra/Solr to Build Real-Time Analytics PlatformDataStax Academy
This session will discuss how Cassandra/Solr can be used to create real-time analytics platform – jKool.
jKool provides an in-memory analysis of time-series data, automatically performing sequencing, correlation, grouping, enriching, synchronizing, computing, querying and displaying data streams. The session will discuss architecture, challenges and approaches taken to create a real-time analytics platform on top of open source big data analytics platforms: Cassandra, Solr, Kafka & Spark.
Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path.
In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.
Opaque: A Data Analytics Platform with Strong Security: Spark Summit East tal...Spark Summit
As enterprises move to cloud-based analytics, the risk of cloud security breaches poses a serious threat. Encrypting data at rest and in transit is a major first step. However, data must still be decrypted in memory for processing, exposing it to an attacker who has compromised the operating system or hypervisor. Trusted hardware such as Intel SGX has recently become available in latest-generation processors. Such hardware enables arbitrary computation on encrypted data while shielding it from a malicious OS or hypervisor. However, it still suffers from a significant side channel: access pattern leakage.
We present Opaque, a package for Apache Spark SQL that enables very strong security for SQL queries: data encryption, computation verification, and access pattern leakage protection (a.k.a. obliviousness). Opaque achieves these guarantees by introducing new oblivious distributed relational operators that provide 2000x performance gain over state of the art oblivious systems, as well as novel query planning techniques for these operators implemented using Catalyst.
Using Spark to Load Oracle Data into Cassandra (Jim Hatcher, IHS Markit) | C*...DataStax
Spark is an execution framework designed to operate on distributed systems like Cassandra. It's a handy tool for many things, including ETL (extract, transform, and load) jobs. In this session, let me share with you some tips and tricks that I have learned through experience. I'm no oracle, but I can guarantee these tips will get you well down the path of pulling your relational data into Cassandra.
About the Speaker
Jim Hatcher Principal Architect, IHS Markit
Jim Hatcher is a software architect with a passion for data. He has spent most of his 20 year career working with relational databases, but he has been working with Big Data technologies such as Cassandra, Solr, and Spark for the last several years. He has supported systems with very large databases at companies like First Data, CyberSource, and Western Union. He is currently working at IHS, supporting an Electronic Parts Database which tracks half a billion electronic parts using Cassandra.
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools.
About the Speakers
Remi Trouville Consultant, Independant
Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.
Although CQL and SQL have a similar syntax, there are many differences that may confuse users. I will explain what the CQL WHERE clause supports and how Cassandra processes it internally.
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
(BDT309) Data Science & Best Practices for Apache Spark on Amazon EMRAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
This talk will address new architectures emerging for large scale streaming analytics. Some based on Spark, Mesos, Akka, Cassandra and Kafka (SMACK) and other newer streaming analytics platforms and frameworks using Apache Flink or GearPump. Popular architecture like Lambda separate layers of computation and delivery and require many technologies which have overlapping functionality. Some of this results in duplicated code, untyped processes, or high operational overhead, let alone the cost (e.g. ETL).
I will discuss the problem domain and what is needed in terms of strategies, architecture and application design and code to begin leveraging simpler data flows. We will cover how the particular set of technologies addresses common requirements and how collaboratively they work together to enrich and reinforce each other.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
Apache cassandra and spark. you got the the lighter, let's start the firePatrick McFadin
Introduction to analyzing Apache Cassandra data using Apache Spark. This includes data models, operations topics and the internal on how Spark interfaces with Cassandra.
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Apache Cassandra and Spark when combined can give powerful OLTP and OLAP functionality for your data. We’ll walk through the basics of both of these platforms before diving into applications combining the two. Usually joins, changing a partition key, or importing data can be difficult in Cassandra, but we’ll see how do these and other operations in a set of simple Spark Shell one-liners!
Delivering Meaning In Near-Real Time At High Velocity In Massive Scale with A...Helena Edelson
Streaming Big Data: Delivering Meaning In Near-Real Time At High Velocity At Massive Scale with Apache Spark, Apache Kafka, Apache Cassandra, Akka and the Spark Cassandra Connector. Why this pairing of technologies and How easy it is to implement. Example application: https://github.com/killrweather/killrweather
London Cassandra Meetup 10/23: Apache Cassandra at British Gas Connected Home...DataStax Academy
Speakers
Jim Anning - Head of Data & Analytics, BGCH
Josep Casals - Lead Data Engineer, BGCH
This presentation will be a mix of strategic overview of platform + technical detail as to how this has been achieved.
Jim will cover off Connected Homes, what they do and where the data platform fits in.
Josep will cover the more technical aspects.
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
Find out about breakthrough architectures for fast OLAP performance querying Cassandra data with Apache Spark, including a new open source project, FiloDB.
5 Ways to Use Spark to Enrich your Cassandra EnvironmentJim Hatcher
Apache Cassandra is a powerful system for supporting large-scale, low-latency data systems, but it has some tradeoffs. Apache Spark can help fill those gaps, and this presentation will show you how.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
Data science with spark on amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
Organizations need to perform increasingly complex analysis on their data — streaming analytics, ad-hoc querying and predictive analytics — in order to get better customer insights and actionable business intelligence. However, the growing data volume, speed, and complexity of diverse data formats make current tools inadequate or difficult to use. Apache Spark has recently emerged as the framework of choice to address these challenges. Spark is a general-purpose processing framework that follows a DAG model and also provides high-level APIs, making it more flexible and easier to use than MapReduce. Thanks to its use of in-memory datasets (RDDs), embedded libraries, fault-tolerance, and support for a variety of programming languages, Apache Spark enables developers to implement and scale far more complex big data use cases, including real-time data processing, interactive querying, graph computations and predictive analytics. In this session, we present a technical deep dive on Spark running on Amazon EMR. You learn why Spark is great for ad-hoc interactive analysis and real-time stream processing, how to deploy and tune scalable clusters running Spark on Amazon EMR, how to use EMRFS with Spark to query data directly in Amazon S3, and best practices and patterns for Spark on Amazon EMR.
A Tale of Two APIs: Using Spark Streaming In ProductionLightbend
Fast Data architectures are the answer to the increasing need for the enterprise to process and analyze continuous streams of data to accelerate decision making and become reactive to the particular characteristics of their market.
Apache Spark is a popular framework for data analytics. Its capabilities include SQL-based analytics, dataflow processing, graph analytics and a rich library of built-in machine learning algorithms. These libraries can be combined to address a wide range of requirements for large-scale data analytics.
To address Fast Data flows, Spark offers two API's: The mature Spark Streaming and its younger sibling, Structured Streaming. In this talk, we are going to introduce both APIs. Using practical examples, you will get a taste of each one and obtain guidance on how to choose the right one for your application.
Similar to Spark Cassandra Connector Dataframes (20)
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Accelerate Enterprise Software Engineering with PlatformlessWSO2
Key takeaways:
Challenges of building platforms and the benefits of platformless.
Key principles of platformless, including API-first, cloud-native middleware, platform engineering, and developer experience.
How Choreo enables the platformless experience.
How key concepts like application architecture, domain-driven design, zero trust, and cell-based architecture are inherently a part of Choreo.
Demo of an end-to-end app built and deployed on Choreo.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Unleash Unlimited Potential with One-Time Purchase
BoxLang is more than just a language; it's a community. By choosing a Visionary License, you're not just investing in your success, you're actively contributing to the ongoing development and support of BoxLang.
Climate Science Flows: Enabling Petabyte-Scale Climate Analysis with the Eart...Globus
The Earth System Grid Federation (ESGF) is a global network of data servers that archives and distributes the planet’s largest collection of Earth system model output for thousands of climate and environmental scientists worldwide. Many of these petabyte-scale data archives are located in proximity to large high-performance computing (HPC) or cloud computing resources, but the primary workflow for data users consists of transferring data, and applying computations on a different system. As a part of the ESGF 2.0 US project (funded by the United States Department of Energy Office of Science), we developed pre-defined data workflows, which can be run on-demand, capable of applying many data reduction and data analysis to the large ESGF data archives, transferring only the resultant analysis (ex. visualizations, smaller data files). In this talk, we will showcase a few of these workflows, highlighting how Globus Flows can be used for petabyte-scale climate analysis.
Cyaniclab : Software Development Agency Portfolio.pdfCyanic lab
CyanicLab, an offshore custom software development company based in Sweden,India, Finland, is your go-to partner for startup development and innovative web design solutions. Our expert team specializes in crafting cutting-edge software tailored to meet the unique needs of startups and established enterprises alike. From conceptualization to execution, we offer comprehensive services including web and mobile app development, UI/UX design, and ongoing software maintenance. Ready to elevate your business? Contact CyanicLab today and let us propel your vision to success with our top-notch IT solutions.
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfJay Das
With the advent of artificial intelligence or AI tools, project management processes are undergoing a transformative shift. By using tools like ChatGPT, and Bard organizations can empower their leaders and managers to plan, execute, and monitor projects more effectively.
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxrickgrimesss22
Discover the essential features to incorporate in your Winzo clone app to boost business growth, enhance user engagement, and drive revenue. Learn how to create a compelling gaming experience that stands out in the competitive market.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Into the Box Keynote Day 2: Unveiling amazing updates and announcements for modern CFML developers! Get ready for exciting releases and updates on Ortus tools and products. Stay tuned for cutting-edge innovations designed to boost your productivity.
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Large Language Models and the End of ProgrammingMatt Welsh
Talk by Matt Welsh at Craft Conference 2024 on the impact that Large Language Models will have on the future of software development. In this talk, I discuss the ways in which LLMs will impact the software industry, from replacing human software developers with AI, to replacing conventional software with models that perform reasoning, computation, and problem-solving.
7. The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
source
org.apache.spark.sql.cassandra
8. The Core is the Cassandra Source
https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra
/**
* Implements [[BaseRelation]]]], [[InsertableRelation]]]] and [[PrunedFilteredScan]]]]
* It inserts data to and scans Cassandra table. If filterPushdown is true, it pushs down
* some filters to CQL
*
*/
DataFrame
CassandraSourceRelation
CassandraTableScanRDDConfiguration
9. Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
10. Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
11. Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
12. Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "ClusterOne"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
13. Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "test" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
14. Configuration Can Be Done on a Per Source Level
clusterName:keyspaceName/propertyName.
Example Changing Cluster/Keyspace Level Properties
val conf = new SparkConf()
.set("ClusterOne/spark.cassandra.input.split.size_in_mb","32")
.set("default:test/spark.cassandra.input.split.size_in_mb","128")
val lastdf = sqlContext
.read
.format("org.apache.spark.sql.cassandra")
.options(Map(
"table" -> "words",
"keyspace" -> "other" ,
"cluster" -> "default"
)
).load()
Namespace: default
Keyspace: test
spark.cassandra.input.split.size_in_mb=128
Namespace: ClusterOne
spark.cassandra.input.split.size_in_mb=32
Connector
Default
15. Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
16. Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
17. Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
18. Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame DataFromC*
Filter
clusteringKey > 100
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
19. Predicate Pushdown Is Automatic!
Select * From cassandraTable where clusteringKey > 100
DataFrame
DataFromC*
AND
add where clause to
CQL
"clusteringKey > 100"
Show
Catalyst
https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-
connector/src/main/scala/org/apache/spark/sql/cassandra/PredicatePushDown.scala
20. What can be pushed down?
1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate
2. Only push down primary key column predicates with = or IN predicate.
3. If there are regular columns in the pushdown predicates, they should have at least one EQ
expression on an indexed column and no IN predicates.
4. All partition column predicates must be included in the predicates to be pushed down, only
the last part of the partition key can be an IN predicate. For each partition column, only one
predicate is allowed.
5. For cluster column predicates, only last predicate can be non-EQ predicate including IN
predicate, and preceding column predicates must be EQ predicates.
6. If there is only one cluster column predicate, the predicates could be any non-IN predicate.
There is no pushdown predicates if there is any OR condition or NOT IN condition.
7. We're not allowed to push down multiple predicates for the same column if any of them is
equality or IN predicate.
21. What can be pushed down?
If you could write in CQL it will get pushed down.
22. What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing
applies
23. What are we Pushing Down To?
CassandraTableScanRDD
All of the underlying code is the same as with
sc.cassandraTable so everything with Reading and Writing
applies
https://academy.datastax.com/
Watch me talk about this in the privacy of your own home!
35. Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
1
780-830
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
36. 1
Node 1
120-220
300-500
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
37. 2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
38. 2
1
Node 1 300-500
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
39. 2
1
Node 1
300-400
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
40. 21
Node 1
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
400-500
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
41. 21
Node 1
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
400-500
3
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
42. 21
Node 1
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
3
400-500
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
43. 21
Node 1
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
44. 4
21
Node 1
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
45. 4
21
Node 1
0-50
The Connector Uses Information on the Node to Make
Spark Partitions
780-830
3
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
46. 421
Node 1
The Connector Uses Information on the Node to Make
Spark Partitions
3
spark.cassandra.input.split_size_in_mb
1
Reported
density
is
100
tokens
per
mb
48. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
49. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
51. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
52. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows
50 CQL Rows
53. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
54. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
55. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
56. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
57. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
58. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows 50 CQL Rows
59. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 780 and token(pk) <= 830
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
60. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
61. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
62. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
63. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
64. 4
spark.cassandra.input.page.row.size 50
Data is Retrieved Using the DataStax Java Driver
0-50
780-830
Node 1
SELECT * FROM keyspace.table WHERE
token(pk) > 0 and token(pk) <= 50
50 CQL Rows50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
50 CQL Rows
66. Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
1 2 3
4 5 6
7 8 9Node 2
Node 1 Node 3
Node 4
67. Node 2
Node 1
Spark RDDs
Represent a Large
Amount of Data
Partitioned into Chunks
RDD
2
346
7 8 9
Node 3
Node 4
1 5
68. Node 2
Node 1
RDD
2
346
7 8 9
Node 3
Node 4
1 5
The Spark Cassandra
Connector saveToCassandra
method can be called on
almost all RDDs
rdd.saveToCassandra("Keyspace","Table")
69. Node 11
A Java Driver connection is made to
the local node and a prepared statement
is built for the target table
Java
Driver
70. Node 11
Batches are built from data in
Spark partitions
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
3,9,1
71. Node 11
By default these batches only
contain CQL Rows which share the same
partition key
Java
Driver
1,1,1
1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,9,1
73. Node 11
When an element is not part of an existing batch,
a new batch is started
Java
Driver
1,1,1 1,2,1
2,1,1
3,8,1
3,2,1
3,4,1
3,5,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,9,1
PK=1
80. Node 11
If more than batch.buffer.size batches
are currently being made,
the largest batch is executed by the Java Driver
Java
Driver
1,1,1 1,2,1
2,1,1
3,1,1
1,4,1
5,4,1
2,4,1
8,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,9,1
PK=1
PK=2
PK=3
84. Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,9,13,9,1
PK=2
PK=3
PK=5
85. Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,18,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,9,13,9,1
Write Acknowledged
PK=2
PK=3
PK=5
86. Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
2,1,1
3,1,1
5,4,1
2,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
8,4,1
3,9,1
PK=2
PK=3
PK=5
87. Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
8,4,1
3,9,1
PK=3
PK=5
88. Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
3,1,1
5,4,1
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
8,4,1
3,9,1
PK=8
PK=3
PK=5
89. Node 11
If more batches are currently being executed
by the Java driver than concurrent.writes, we
wait until one of the requests has been completed.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
90. Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
91. Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
92. Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
93. Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
Write Acknowledged
94. Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
Block
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
95. Node 11
The last parameter throughput_mb_per_sec
blocks further batches if we have written more than
that much in the past second.
Java
Driver
9,4,1
11,4,
spark.cassandra.output.batch.grouping.key
partition
spark.cassandra.output.batch.size.rows
4
spark.cassandra.output.batch.buffer.size
3
spark.cassandra.output.concurrent.writes
2
spark.cassandra.output.throughput_mb_per_sec
5
3,1,1
5,4,1
8,4,1
3,9,1
PK=8
PK=3
PK=5
96. Thanks for Coming and I hope you Have a Great Time
At C* Summit
http://cassandrasummit-datastax.com/agenda/the-spark-
cassandra-connector-past-present-and-future/
Also ask these guys really hard questions
Jacek PiotrAlex