Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Lessons Learned with Cassandra & Spark at the USPTO

Christopher Bradford
• DataStax Certified Cassandra Architect
•Contributed to CQLEngine - Python C*
•ORM
•Developed Trireme - a migration
•engine for Cassandra & DSE
•Created the world’s smallest C*
•cluster
Twitter: @bradfordcp
GitHub: bradfordcp

OpenSource Connections
• Consulting firm based in Charlottesville Virginia
• Founded in 2005
• Focused on Search in 2010, specifically Solr and
Lucene
• Delivering Cassandra Consulting since 2012
• Datastax Gold Partner
• Great with Search, Analytics and Discovery

OpenSource Connections
Blog
http://o19s.com/blog/
Twitter
@o19s
GitHub
o19s

Data
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Patent Applications & Grants
Applications Grants

WHERE
I DON’T THINK YOU KNOW WHAT THAT MEANS
YOU KEEP USING THAT CLAUSE

CQL vs SQL: WHERE
type | name | rank
------+----------+-------
SQL
CQL
SELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;
CREATE TABLE names (
type VARCHAR,
name VARCHAR,
rank INT,
PRIMARY KEY ((type, name))
);

CQL vs SQL: Indexes
SQL
CQL
type | name | rank
------+----------+-------
CREATE INDEX ON names (rank);

CQL vs SQL: Recap
• Consider multiple tables with data models that support
fast, efficient, querying.
• Remember that writes are extremely fast in C*. Writing to multiple tables is
not necessarily a bad thing.
• Build an index table
• Your model may support building an inverted index for lookups of record ids.
• Use secondary indexes***

Unbalanced Cluster Symptoms
• Certain nodes shutting down mid-way through ingestion
• Data is not cleanly distributed across the cluster

Unbalanced Cluster Causes
• Data Model – check your partitions!
• Configuration – how are your tokens split amongst the nodes?
• Hardware – is the server configured correctly?

Balancing: Data Model
CREATE TABLE images (
year INT,
id TEXT,
page TEXT,
image BLOB,
PRIMARY KEY (year, id, page)
);
SELECT * FROM images WHERE year = 2015;
Sample unbalanced model

Balancing: Data Model
CREATE TABLE images (
year INT,
month INT,
id TEXT,
page INT,
image BLOB,
PRIMARY KEY ((year, month), id, page)
);
SELECT * FROM images WHERE year = 2015 AND month IN (1,…);
Switch partition key to use multiple fields instead of just year.

Virtual Nodes?
Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html

Hardware
• Understand the type of hardware Cassandra runs well on.
LOCAL
STORAGE
NETWORK
STORAGE

Data Loading Performance
• Did it work?
• Why change it?
• How could we make it better?

Metrics
• GitHub:
• dropwizard/metrics
• Awesome Java library for
collecting metrics in your code
• Demo later

Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done

Optimum Performance
joinedRDD = …
joinedRDD.foreach()
sc.push(document)
sc.disconnect()
// Job is done

Scope: Review
joinedRDD = …
joinedRDD.foreach()
sc.push(document)
sc.disconnect()
// Job is done

Scope: ERROR
Exception in thread "main"
org.apache.spark.SparkException: Task not
serializable

Scope: Fixed!
joinedRDD = …
joinedRDD.foreachPartition()
partition.foreach()
document = …
sc.push(document)
// Job is done

APIs: mapPartitions()
joinedRDD = …
joinedRDD.mapPartitions()
partition.foreach()
sc.push(document)
return partition.rows

APIs: Transformations & Actions
• Transformations: Lazily
executed, the code is not
executed until an action is
applied.
• Ex: map
• Actions:
• Operate on RDD elements
and return to the driver
• Ex: foreach

APIs: mapPartitions()
joinedRDD = …
partition.foreach()
sc.push(document)
return partition.rows
.collect()

Understand how data is passed around

Memory: Solution
joinedRDD = …
partition.foreach()
sc.push(document)
return partition.rows.length
.collect()

Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Similar to Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office (20)

More from DataStax Academy

More from DataStax Academy (20)

Recently uploaded

Recently uploaded (20)

Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Editor's Notes