This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
2. Christopher Bradford
• DataStax Certified Cassandra Architect
•Contributed to CQLEngine - Python C*
•ORM
•Developed Trireme - a migration
•engine for Cassandra & DSE
•Created the world’s smallest C*
•cluster
Twitter: @bradfordcp
GitHub: bradfordcp
3. OpenSource Connections
• Consulting firm based in Charlottesville Virginia
• Founded in 2005
• Focused on Search in 2010, specifically Solr and
Lucene
• Delivering Cassandra Consulting since 2012
• Datastax Gold Partner
• Great with Search, Analytics and Discovery
13. CQL vs SQL: WHERE
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
SQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names WHERE rank = 59290;
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the
restricted columns support the provided operators: "
CREATE TABLE names (
type VARCHAR,
name VARCHAR,
rank INT,
PRIMARY KEY ((type, name))
);
14. CQL vs SQL: WHERE
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
SQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;
last | VRANES | 59290
CREATE TABLE names (
type VARCHAR,
name VARCHAR,
rank INT,
PRIMARY KEY ((type, name))
);
15. CQL vs SQL: Tables
rank | type | name
-------+------+----------
25067 | last | STOBAUGH
65304 | last | BRUDNER
12517 | last | SKLAR
59290 | last | VRANES
34764 | last | SCHRODT
SQL
SELECT * FROM names_by_rank WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names_by_rank WHERE rank = 59290;
last | VRANES | 59290
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
names names_by_rank
16. CQL vs SQL: Indexes
SQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
CREATE INDEX ON names (rank);
17. CQL vs SQL: Recap
• Consider multiple tables with data models that support
fast, efficient, querying.
• Remember that writes are extremely fast in C*. Writing to multiple tables is
not necessarily a bad thing.
• Build an index table
• Your model may support building an inverted index for lookups of record ids.
• Use secondary indexes***
19. Unbalanced Cluster Symptoms
• Certain nodes shutting down mid-way through ingestion
• Data is not cleanly distributed across the cluster
20. Unbalanced Cluster Causes
• Data Model – check your partitions!
• Configuration – how are your tokens split amongst the nodes?
• Hardware – is the server configured correctly?
22. Balancing: Data Model
CREATE TABLE images (
year INT,
id TEXT,
page TEXT,
image BLOB,
PRIMARY KEY (year, id, page)
);
SELECT * FROM images WHERE year = 2015;
Sample unbalanced model
23. Balancing: Data Model
CREATE TABLE images (
year INT,
month INT,
id TEXT,
page INT,
image BLOB,
PRIMARY KEY ((year, month), id, page)
);
SELECT * FROM images WHERE year = 2015 AND month IN (1,…);
Switch partition key to use multiple fields instead of just year.
44. APIs: Transformations & Actions
• Transformations: Lazily
executed, the code is not
executed until an action is
applied.
• Ex: map
• Actions:
• Operate on RDD elements
and return to the driver
• Ex: foreach
We are based in Charlottesville, Virginia.
We’ve always been interested in search, (one of our founders wrote the book on it). In 2010 we really made search our focus and have been adding related technologies to really help deliver on full text search.
In 2012 we also started delivering Cassandra consulting, and we are currently a Datastax Gold Partner.
Really active bloggers, with a bunch of open source code and projects
Relevant search will be out soon, great book about the art of tuning search results.
Building a search server with ElasticSearch -> is a great video introduction to both the Angular javascript framework and ElasticSearch.
Apache Solr Enterprise is the definitive guide for planning, building and maintaining Apache Solr
Sections
- Cassandra and common pitfalls when starting to develop against it.
- Spark and the trials encountered while implementing an ETL tool.
EST
225 years of patent data starting in 1790
Patents are currently stored as TIF images with XML documents providing metadata and content (currently around 250 fields per patent)
Multiple collections spanning data many countries (2 currently implemented with an additional 5 coming online this year)
Supports a custom query syntax which has been used at the Patent Office over the past 30 years
DataStax Enterprise 4.5 and 4.6
Cassandra 2.0
Solr 4.10.2
Spark 0.9.2 and 1.1 (more on that later)
The application is composed of a rich front end application, an API layer, and data layer
For the purposes of this talk we’re going to focus here on the data layer.
Spark isn't shown at the moment, but we'll get there.
EST is a system for searching patents and applications from 1790 to today. Here we see the number of applications and grants from 1963 through 2014.
Each patent is composed of a few components. The canonical patent is a set of TIF images. Metadata surrounding the patent is stored in XML files and databases.
This project must ingest the data from these data sources (compressed archives and legacy systems) and make it searchable.
Or “How your queries fail because WHERE does not work the way you think it should”
When getting started with Cassandra the 1st issues encountered involve CQL.
Coming from a relational world, it looks so much like SQL that we are lured into the false sense of comfort.
I remember seeing the SELECT * queries in tutorials and thinking “I’ve got this”. I was wrong.
Instead of getting data back I was met with error, after error, after error
Here we have a table which holds some census data about names. The table has 3 columns, type and name which are partition keys and rank which is not indexed, merely present.
In SQL you can query on most columns without hassle, depending on the volume of data queries may be slow, but eventually data will come back.
In CQL columns may only be queried if they are part of the PRIMARY KEY or have a secondary index applied to them. In this case the query is not executed.
To get our query to execute we must filter on ALL the partition key columns (and subsequently clustering columns, but this is optional). In our case rank isn’t indexed at all so we cannot include it in the query.
This leaves us with a query that doesn’t match the sentiment that we had in our SQL query.
In fact they don’t function the same at all. How can we go about finding a name by it’s rank?
We could construct a new table that has rank as the primary key column.
This allows for the data to be queried in CQL just like our SQL from before.
There are a few drawbacks here, now we are maintaining data in 2 tables, names and names by rank. This isn’t too bad since the size of the data is small. We need to be sure to update both tables when changes are made to the data.
Another concern is with the new table. Care must be taken to ensure that the PRIMARY KEY on the new table behaves the way we expect. In this case our new PRIMARY KEY must be ((rank), type). Without the type value as a clustering column we would have upserts when inserting a first and last name with the same rank.
Another option is to create a secondary index on the column. Secondary indexes are local indexes, meaning they exist on each node.
If we reference a partition key in the query along with a secondary index the query will only go to the node responsible for that partition before performing the lookup on the local index.
Cassandra handles keeping the index up to date as data changes. If you’re looking for a distributed index consider setting up a index table (like the one in the last slide).
*** Secondary indexes are NOT always the answer. For a while they were not recommended at all. Ensure the field being indexed is low cardinality and that only 1 secondary index is used in a query. Using 2 will only consult 1 index, then filtering will be performed on all rows for the second index column.
EST used point lookups against C*. Complex queries (including ranges) were run against the Solr cluster.
WHERE can be tricky when starting out with C*. I recommend checking out DataStax’s blog where Benjamin Lerer has an excellent article called “A deep look to the CQL where clause”
http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause
It really goes in to all the ways you can use WHERE given various data models.
Or “Why did the cluster go down after hours of ingesting data?”
Our next lesson concerns keeping the C*cluster balanced. Cassandra does a lot of work trying to keep the data evenly distributed across all of the nodes in our cluster.
Even with these optimizations and approaches things can still get a bit out of whack.
Let’s look at the symptoms of a unbalanced cluster
OpsCenter showed certain nodes were completely full while other had ~ 10MB of data on them. What’s happening?!
Data Model: certain partitions in the data naturally have more data
Configuration: The cluster may be configured in such a way that data isn’t evenly distributed
Hardware: Nodes may be experiencing hardware issues, or misconfiguration on the OS level
Let’s apply some context to these numbers.
Images
Pages
Metadata for each
Fix issues within the data model that can lead to unbalanced nodes
Note how the partition key for this record is year. In this case every image of, every page, of a patent is stored in a single wide row with the year as the row key.
Looking at our graph the width of this row will simply keep getting bigger and bigger.
In this model we have added the month field and placed it as part of the partition key. Now patents are distributed into smaller, more manageable buckets.
It’s worth noting that with this approach all SELECT’s will require specifying both the year AND month. This can be done with the IN clause or (as described recently in a DataStax blog post, with separate asynchronous queries)
For more information on how data is stored in Cassandra check out the excellent deep dive on the CQL storage engine by John Berryman on Planet Cassandra.
Virtual Nodes?
By default Cassandra is configured with virtual nodes enabled. This means that for each node in the cluster it has multiple chunks of the token ring that are randomly assigned to it.
In some cases virtual nodes are disabled, such as when running Hadoop or Solr on top of Cassandra.
Look into the cluster configuration. In our case we were using single token nodes. We had gross balancing issues with certain nodes completely filling their disks and others sitting mostly empty.
Why virtual nodes?
You no longer have to calculate and assign tokens to each node.
Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.
Rebuilding a dead node is faster because it involves every other node in the cluster.
Improves the use of heterogeneous machines in a cluster. You can assign a proportional number of vnodes to smaller and larger machines.
How did this help?
We no longer had issues with cluster balancing. The deviation in storage use was minimal with all nodes showing equal utilization.
It’s worth noting that some of the larger Cassandra clusters deployed today are not using virtual nodes.
Or “Why does my code time out in the development environment, but not staging or production?”
Another lesson learned involves the hardware used to run C*. In our case the staging, and production environments were on dedicated hardware, but QA and development clusters were provisioned as VMs
This isn’t necessarily an issue. VMs can be quite performant and there are some people running large C* clusters on virtual platforms.
In our case these machines did not have their C* storage on local disks. Instead the mount points were provisioned on a NAS.
As we were working to develop features exceptions kept getting raised in the development and qa clusters that were not manifesting locally. Our developers spent time chasing bugs that only existed in these two environments.
Now let’s move on to our next section, Spark.
To understand where Spark fits in to the project, let’s look back at the data layer and how data enters the system.
We have two sets of data ingestion jobs on the cluster.
Did it work?
Technically, yes
Why change it?
It didn’t meet the SLA. Even with a fairly large number of processes running we couldn’t meet the re-ingestion SLA requirements
How could we make it better?
There are two possible approaches
Optimize the C2S process
add caching
multi-thread where possible
We ended up doing this. It met the SLA, but just barely. We asked ourselves “What happens when the dataset increases?”
Look for a new way to ingest the data
In the new approach the job is submitted to the Spark cluster.
Joined data is loaded into a RDD
The RDD is mapped into Solr documents
Solr documents are batched and pushed to Solr Cloud
Q: How did this work?
A: Not too well. It was a little faster than the original process, but not by much. There was no major load on the Solr cluster, the bottleneck was definitely within the Spark job.
How did we move forward?
Why is my job running so slow?
Metrics, Metrics, Metrics
By running the job with metrics enabled. We instrumented every method call with timings and collated the results when the job completed. This painted a pretty clear picture on where we were spinning our wheels.
In our code we create an RDD or Resilient Distributed Dataset with a SQL query to C*. Basically our SQL call is built into multiple C* queries that are then mapped and joined together and returned as a single set of rows.
The majority of our work was being done in a foreach on the joined RDD. Each iteration within the foreach loop would connect, send the document, then continue.
The logic which created a connection to the SolrCloud cluster was a huge drain on time. The creation of the HTTP client took 4 times longer than any other part of the job.
Obviously the best solution here would be to use a single connection for sending the documents over.
This lead us to our next lesson…
“What’s up with this object not being available in my loop?”
We’ve determined that this should work. We finish writing the job, package it up, and ship it off to the cluster
What do we get?
What?! This was supposed to make everything better!
The problem here is that the function being executed is packed up and shipped off to the executors. In this case the executors don’t have the Solr connection available.
We’ve determined that this should work. We finish writing the job and hit build. We’re met with an error?!
From here we started digging and ended up…
“What do you mean that method doesn’t exist? IT’S IN THE DOCUMENTATION!”
Really knowing our API.
What now?
Our jobs were written in Java against the Java API with version 0.9.2 of Spark.
Scanning the docs revealed the much needed foreachPartition method on RDD, but what we didn’t see was that it’s only available to Scala. When in the land of Java, like we were, we receive a JavaRDD back from the Cassandra SQL call, not a RDD.
Now our magical method that should fix everything isn’t here.
What’s next?
How about this code? Did it work?
Nope, the job ran and never executed the mapPartitions call.
In Spark an RDD may have a transformation or action applied to it.
This allows better performance as the entire mapped dataset isn’t returned to the driver, instead the results of the action (usually a smaller value) are returned.
Now with an action, the transformation mapPartitions gets applied, but our driver crashes with an out of memory exception.
Lesson!
In our previous bit of code we hit an out of memory exception. This is because all of the rows from the map are being instantiated and returned to the driver, in our case we really don’t need the data.
Here we have instead decided to just return the number of rows processed per partition.
These values may be used to provide metrics and keep a tally on the number of records processed while keeping memory usage low.
The new Spark based process was well within the SLA. Provided additional admin features and …
5x decrease in ingestion time