SlideShare a Scribd company logo
1 of 51
Lessons Learned with Cassandra & Spark at the USPTO
Christopher Bradford
• DataStax Certified Cassandra Architect
•Contributed to CQLEngine - Python C*
•ORM
•Developed Trireme - a migration
•engine for Cassandra & DSE
•Created the world’s smallest C*
•cluster
Twitter: @bradfordcp
GitHub: bradfordcp
OpenSource Connections
• Consulting firm based in Charlottesville Virginia
• Founded in 2005
• Focused on Search in 2010, specifically Solr and
Lucene
• Delivering Cassandra Consulting since 2012
• Datastax Gold Partner
• Great with Search, Analytics and Discovery
OpenSource Connections
Blog
http://o19s.com/blog/
Twitter
@o19s
GitHub
o19s
Exploring Search Technologies
Technologies
Architecture
Architecture – Data Layer
Data
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Patent Applications & Grants
Applications Grants
WHERE Clauses
WHERE
I DON’T THINK YOU KNOW WHAT THAT MEANS
YOU KEEP USING THAT CLAUSE
CQL vs SQL: WHERE
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
SQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names WHERE rank = 59290;
InvalidRequest: code=2200 [Invalid query] message="No secondary indexes on the
restricted columns support the provided operators: "
CREATE TABLE names (
type VARCHAR,
name VARCHAR,
rank INT,
PRIMARY KEY ((type, name))
);
CQL vs SQL: WHERE
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
SQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names WHERE type = ‘last’ AND name = ‘VRANES’;
last | VRANES | 59290
CREATE TABLE names (
type VARCHAR,
name VARCHAR,
rank INT,
PRIMARY KEY ((type, name))
);
CQL vs SQL: Tables
rank | type | name
-------+------+----------
25067 | last | STOBAUGH
65304 | last | BRUDNER
12517 | last | SKLAR
59290 | last | VRANES
34764 | last | SCHRODT
SQL
SELECT * FROM names_by_rank WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names_by_rank WHERE rank = 59290;
last | VRANES | 59290
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
names names_by_rank
CQL vs SQL: Indexes
SQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
CQL
SELECT * FROM names WHERE rank = 59290;
last | VRANES | 59290
type | name | rank
------+----------+-------
last | STOBAUGH | 25067
last | BRUDNER | 65304
last | SKLAR | 12517
last | VRANES | 59290
last | SCHRODT | 34764
CREATE INDEX ON names (rank);
CQL vs SQL: Recap
• Consider multiple tables with data models that support
fast, efficient, querying.
• Remember that writes are extremely fast in C*. Writing to multiple tables is
not necessarily a bad thing.
• Build an index table
• Your model may support building an inverted index for lookups of record ids.
• Use secondary indexes***
Cluster Balancing
Unbalanced Cluster Symptoms
• Certain nodes shutting down mid-way through ingestion
• Data is not cleanly distributed across the cluster
Unbalanced Cluster Causes
• Data Model – check your partitions!
• Configuration – how are your tokens split amongst the nodes?
• Hardware – is the server configured correctly?
Data
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
Patent Applications & Grants
Applications Grants
Balancing: Data Model
CREATE TABLE images (
year INT,
id TEXT,
page TEXT,
image BLOB,
PRIMARY KEY (year, id, page)
);
SELECT * FROM images WHERE year = 2015;
Sample unbalanced model
Balancing: Data Model
CREATE TABLE images (
year INT,
month INT,
id TEXT,
page INT,
image BLOB,
PRIMARY KEY ((year, month), id, page)
);
SELECT * FROM images WHERE year = 2015 AND month IN (1,…);
Switch partition key to use multiple fields instead of just year.
Balancing: Configuration
Virtual Nodes?
Source: http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureDataDistributeDistribute_c.html
Hardware
Hardware
• Understand the type of hardware Cassandra runs well on.
LOCAL
STORAGE
NETWORK
STORAGE
Data Ingestion
Data Loading Performance
• Did it work?
• Why change it?
• How could we make it better?
Spark Data Loading
Collecting Metrics
Metrics
• GitHub:
• dropwizard/metrics
• Awesome Java library for
collecting metrics in your code
• Demo later
Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done
Poor Performance
joinedRDD = …
joinedRDD.foreach()
document = … // build document
sc = new SolrConnection()
sc.push(document)
sc.disconnect()
// Job is done
Optimum Performance
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Scope
Scope: Review
joinedRDD = …
sc = new SolrConnection()
joinedRDD.foreach()
document = … // build document
sc.push(document)
sc.disconnect()
// Job is done
Scope: ERROR
Exception in thread "main"
org.apache.spark.SparkException: Task not
serializable
Scope: Fixed!
joinedRDD = …
joinedRDD.foreachPartition()
sc = new SolrConnection()
partition.foreach()
document = …
sc.push(document)
// Job is done
Know your API
Java RDD != Scala RDD
APIs: mapPartitions()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
return partition.rows
APIs: Transformations & Actions
• Transformations: Lazily
executed, the code is not
executed until an action is
applied.
• Ex: map
• Actions:
• Operate on RDD elements
and return to the driver
• Ex: foreach
APIs: mapPartitions()
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
return partition.rows
.collect()
Understand how data is passed around
Memory: Solution
joinedRDD = …
joinedRDD.mapPartitions()
sc = new SolrConnection()
partition.foreach()
document = … // build document
sc.push(document)
return partition.rows.length
.collect()
How did it go?
Demo
Questions?
Thank You
© 2015. All Rights Reserved. 51

More Related Content

What's hot

Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesScyllaDB
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformScyllaDB
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaScyllaDB
 
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...ScyllaDB
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS InsightScylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS InsightScyllaDB
 
RDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful MigrationsRDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful MigrationsScyllaDB
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkGuido Schmutz
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoUri Savelchev
 
Acid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeAcid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeMichal Gancarski
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataVictor Coustenoble
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data prajods
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Helena Edelson
 

What's hot (20)

Cassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary DifferencesCassandra vs. ScyllaDB: Evolutionary Differences
Cassandra vs. ScyllaDB: Evolutionary Differences
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile PlatformReal-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
Real-time Fraud Detection for Southeast Asia’s Leading Mobile Platform
 
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...
 
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Building Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and KafkaBuilding Event Streaming Architectures on Scylla and Kafka
Building Event Streaming Architectures on Scylla and Kafka
 
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS InsightScylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
Scylla Summit 2018: From SAP to Scylla - Tracking the Fleet at GPS Insight
 
RDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful MigrationsRDBMS to NoSQL: Practical Advice from Successful Migrations
RDBMS to NoSQL: Practical Advice from Successful Migrations
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Real-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache SparkReal-Time Analytics with Apache Cassandra and Apache Spark
Real-Time Analytics with Apache Cassandra and Apache Spark
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in ZalandoPostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
PostgreSQL Finland October meetup - PostgreSQL monitoring in Zalando
 
Acid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta LakeAcid ORC, Iceberg and Delta Lake
Acid ORC, Iceberg and Delta Lake
 
Spark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational DataSpark + Cassandra = Real Time Analytics on Operational Data
Spark + Cassandra = Real Time Analytics on Operational Data
 
Cassandra & Spark for IoT
Cassandra & Spark for IoTCassandra & Spark for IoT
Cassandra & Spark for IoT
 
Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data Apache Cassandra and Python for Analyzing Streaming Big Data
Apache Cassandra and Python for Analyzing Streaming Big Data
 
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
Lambda Architecture with Spark, Spark Streaming, Kafka, Cassandra, Akka and S...
 

Viewers also liked

Production Ready Cassandra (Beginner)
Production Ready Cassandra (Beginner)Production Ready Cassandra (Beginner)
Production Ready Cassandra (Beginner)DataStax Academy
 
Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of CassandraDataStax Academy
 
Introduction to .Net Driver
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net DriverDataStax Academy
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureDataStax Academy
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph DatabasesDataStax Academy
 
Successful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraSuccessful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraDataStax Academy
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)DataStax Academy
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to CassandraDataStax Academy
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talkDataStax Academy
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
 
NoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre BittnerNoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre Bittnerdistributed matters
 
Estimating Financial Risk with Spark
Estimating Financial Risk with SparkEstimating Financial Risk with Spark
Estimating Financial Risk with SparkDataWorks Summit
 
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsBeginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsDataStax Academy
 
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...DataStax Academy
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...DataStax Academy
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...DataStax Academy
 

Viewers also liked (20)

Production Ready Cassandra (Beginner)
Production Ready Cassandra (Beginner)Production Ready Cassandra (Beginner)
Production Ready Cassandra (Beginner)
 
Coursera's Adoption of Cassandra
Coursera's Adoption of CassandraCoursera's Adoption of Cassandra
Coursera's Adoption of Cassandra
 
New features in 3.0
New features in 3.0New features in 3.0
New features in 3.0
 
Introduction to .Net Driver
Introduction to .Net DriverIntroduction to .Net Driver
Introduction to .Net Driver
 
Spark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and FurureSpark Cassandra Connector: Past, Present and Furure
Spark Cassandra Connector: Past, Present and Furure
 
Playlists at Spotify
Playlists at SpotifyPlaylists at Spotify
Playlists at Spotify
 
Getting Started with Graph Databases
Getting Started with Graph DatabasesGetting Started with Graph Databases
Getting Started with Graph Databases
 
Successful Software Development with Apache Cassandra
Successful Software Development with Apache CassandraSuccessful Software Development with Apache Cassandra
Successful Software Development with Apache Cassandra
 
Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)Cassandra: One (is the loneliest number)
Cassandra: One (is the loneliest number)
 
Traveler's Guide to Cassandra
Traveler's Guide to CassandraTraveler's Guide to Cassandra
Traveler's Guide to Cassandra
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
Make 2016 your year of SMACK talk
Make 2016 your year of SMACK talkMake 2016 your year of SMACK talk
Make 2016 your year of SMACK talk
 
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...
 
NoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre BittnerNoSQL in Financial Industry - Pierre Bittner
NoSQL in Financial Industry - Pierre Bittner
 
Estimating Financial Risk with Spark
Estimating Financial Risk with SparkEstimating Financial Risk with Spark
Estimating Financial Risk with Spark
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra OpsBeginning Operations: 7 Deadly Sins for Apache Cassandra Ops
Beginning Operations: 7 Deadly Sins for Apache Cassandra Ops
 
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...
 
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...
 
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...
 

Similar to Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Azuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data FactoryAzuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data FactoryRiccardo Perico
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014dhiguero
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Andrés de la Peña
 
SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013
SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013
SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013Ryan McIntyre
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comJungsu Heo
 
The SharePoint & jQuery Guide
The SharePoint & jQuery GuideThe SharePoint & jQuery Guide
The SharePoint & jQuery GuideMark Rackley
 
The SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechConThe SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechConSPTechCon
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...Duyhai Doan
 
Introduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesJo-fai Chow
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache DrillMapR Technologies
 
USQ Landdemos Azure Data Lake
USQ Landdemos Azure Data LakeUSQ Landdemos Azure Data Lake
USQ Landdemos Azure Data LakeTrivadis
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AITorsten Steinbach
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformGoDataDriven
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command lineSharat Chikkerur
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaEdureka!
 
SQL Saturday Madrid 2019 - Data model with Azure Cosmos DB
SQL Saturday Madrid 2019 - Data model with Azure Cosmos DBSQL Saturday Madrid 2019 - Data model with Azure Cosmos DB
SQL Saturday Madrid 2019 - Data model with Azure Cosmos DBAlberto Diaz Martin
 
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)andrei.arion
 

Similar to Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office (20)

Azuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data FactoryAzuresatpn19 - An Introduction To Azure Data Factory
Azuresatpn19 - An Introduction To Azure Data Factory
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
Advanced search and Top-k queries in Cassandra - Cassandra Summit Europe 2014
 
SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013
SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013
SharePoint 2013 Search - A Developer’s Perspective - SPSSV 2013
 
Web analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.comWeb analytics at scale with Druid at naver.com
Web analytics at scale with Druid at naver.com
 
The SharePoint & jQuery Guide
The SharePoint & jQuery GuideThe SharePoint & jQuery Guide
The SharePoint & jQuery Guide
 
The SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechConThe SharePoint and jQuery Guide by Mark Rackley - SPTechCon
The SharePoint and jQuery Guide by Mark Rackley - SPTechCon
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
Cassandra and Spark, closing the gap between no sql and analytics   codemotio...Cassandra and Spark, closing the gap between no sql and analytics   codemotio...
Cassandra and Spark, closing the gap between no sql and analytics codemotio...
 
Introduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use CasesIntroduction to H2O and Model Stacking Use Cases
Introduction to H2O and Model Stacking Use Cases
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
USQ Landdemos Azure Data Lake
USQ Landdemos Azure Data LakeUSQ Landdemos Azure Data Lake
USQ Landdemos Azure Data Lake
 
Coud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AICoud-based Data Lake for Analytics and AI
Coud-based Data Lake for Analytics and AI
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
05 entity framework
05 entity framework05 entity framework
05 entity framework
 
Data science at the command line
Data science at the command lineData science at the command line
Data science at the command line
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
SQL Saturday Madrid 2019 - Data model with Azure Cosmos DB
SQL Saturday Madrid 2019 - Data model with Azure Cosmos DBSQL Saturday Madrid 2019 - Data model with Azure Cosmos DB
SQL Saturday Madrid 2019 - Data model with Azure Cosmos DB
 
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
Tsunami alerting with Cassandra (From 0 to Cassandra on AWS in 30 days)
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and DriversDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 
Apache Cassandra and Drivers
Apache Cassandra and DriversApache Cassandra and Drivers
Apache Cassandra and Drivers
 

Recently uploaded

Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringWSO2
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...FIDO Alliance
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsLeah Henrickson
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaWSO2
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingWSO2
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...caitlingebhard1
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityVictorSzoltysek
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuidePixlogix Infotech
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptxFIDO Alliance
 

Recently uploaded (20)

Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Choreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software EngineeringChoreo: Empowering the Future of Enterprise Software Engineering
Choreo: Empowering the Future of Enterprise Software Engineering
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 

Lessons Learned with Cassandra and Spark at the US Patent and Trademark Office

Editor's Notes

  1. We are based in Charlottesville, Virginia. We’ve always been interested in search, (one of our founders wrote the book on it). In 2010 we really made search our focus and have been adding related technologies to really help deliver on full text search. In 2012 we also started delivering Cassandra consulting, and we are currently a Datastax Gold Partner.
  2. Really active bloggers, with a bunch of open source code and projects Relevant search will be out soon, great book about the art of tuning search results. Building a search server with ElasticSearch -> is a great video introduction to both the Angular javascript framework and ElasticSearch. Apache Solr Enterprise is the definitive guide for planning, building and maintaining Apache Solr
  3. Sections - Cassandra and common pitfalls when starting to develop against it. - Spark and the trials encountered while implementing an ETL tool. EST 225 years of patent data starting in 1790 Patents are currently stored as TIF images with XML documents providing metadata and content (currently around 250 fields per patent) Multiple collections spanning data many countries (2 currently implemented with an additional 5 coming online this year) Supports a custom query syntax which has been used at the Patent Office over the past 30 years
  4. DataStax Enterprise 4.5 and 4.6 Cassandra 2.0 Solr 4.10.2 Spark 0.9.2 and 1.1 (more on that later)
  5. The application is composed of a rich front end application, an API layer, and data layer
  6. For the purposes of this talk we’re going to focus here on the data layer. Spark isn't shown at the moment, but we'll get there.
  7. EST is a system for searching patents and applications from 1790 to today. Here we see the number of applications and grants from 1963 through 2014. Each patent is composed of a few components. The canonical patent is a set of TIF images. Metadata surrounding the patent is stored in XML files and databases. This project must ingest the data from these data sources (compressed archives and legacy systems) and make it searchable.
  8. Or “How your queries fail because WHERE does not work the way you think it should” When getting started with Cassandra the 1st issues encountered involve CQL. Coming from a relational world, it looks so much like SQL that we are lured into the false sense of comfort.
  9. I remember seeing the SELECT * queries in tutorials and thinking “I’ve got this”. I was wrong. Instead of getting data back I was met with error, after error, after error
  10. Here we have a table which holds some census data about names. The table has 3 columns, type and name which are partition keys and rank which is not indexed, merely present. In SQL you can query on most columns without hassle, depending on the volume of data queries may be slow, but eventually data will come back. In CQL columns may only be queried if they are part of the PRIMARY KEY or have a secondary index applied to them. In this case the query is not executed.
  11. To get our query to execute we must filter on ALL the partition key columns (and subsequently clustering columns, but this is optional). In our case rank isn’t indexed at all so we cannot include it in the query. This leaves us with a query that doesn’t match the sentiment that we had in our SQL query. In fact they don’t function the same at all. How can we go about finding a name by it’s rank?
  12. We could construct a new table that has rank as the primary key column. This allows for the data to be queried in CQL just like our SQL from before. There are a few drawbacks here, now we are maintaining data in 2 tables, names and names by rank. This isn’t too bad since the size of the data is small. We need to be sure to update both tables when changes are made to the data. Another concern is with the new table. Care must be taken to ensure that the PRIMARY KEY on the new table behaves the way we expect. In this case our new PRIMARY KEY must be ((rank), type). Without the type value as a clustering column we would have upserts when inserting a first and last name with the same rank.
  13. Another option is to create a secondary index on the column. Secondary indexes are local indexes, meaning they exist on each node. If we reference a partition key in the query along with a secondary index the query will only go to the node responsible for that partition before performing the lookup on the local index. Cassandra handles keeping the index up to date as data changes. If you’re looking for a distributed index consider setting up a index table (like the one in the last slide).
  14. *** Secondary indexes are NOT always the answer. For a while they were not recommended at all. Ensure the field being indexed is low cardinality and that only 1 secondary index is used in a query. Using 2 will only consult 1 index, then filtering will be performed on all rows for the second index column. EST used point lookups against C*. Complex queries (including ranges) were run against the Solr cluster. WHERE can be tricky when starting out with C*. I recommend checking out DataStax’s blog where Benjamin Lerer has an excellent article called “A deep look to the CQL where clause” http://www.datastax.com/dev/blog/a-deep-look-to-the-cql-where-clause It really goes in to all the ways you can use WHERE given various data models.
  15. Or “Why did the cluster go down after hours of ingesting data?” Our next lesson concerns keeping the C*cluster balanced. Cassandra does a lot of work trying to keep the data evenly distributed across all of the nodes in our cluster. Even with these optimizations and approaches things can still get a bit out of whack. Let’s look at the symptoms of a unbalanced cluster
  16. OpsCenter showed certain nodes were completely full while other had ~ 10MB of data on them. What’s happening?!
  17. Data Model: certain partitions in the data naturally have more data Configuration: The cluster may be configured in such a way that data isn’t evenly distributed Hardware: Nodes may be experiencing hardware issues, or misconfiguration on the OS level
  18. Let’s apply some context to these numbers. Images Pages Metadata for each
  19. Fix issues within the data model that can lead to unbalanced nodes Note how the partition key for this record is year. In this case every image of, every page, of a patent is stored in a single wide row with the year as the row key. Looking at our graph the width of this row will simply keep getting bigger and bigger.
  20. In this model we have added the month field and placed it as part of the partition key. Now patents are distributed into smaller, more manageable buckets. It’s worth noting that with this approach all SELECT’s will require specifying both the year AND month. This can be done with the IN clause or (as described recently in a DataStax blog post, with separate asynchronous queries) For more information on how data is stored in Cassandra check out the excellent deep dive on the CQL storage engine by John Berryman on Planet Cassandra.
  21. Virtual Nodes? By default Cassandra is configured with virtual nodes enabled. This means that for each node in the cluster it has multiple chunks of the token ring that are randomly assigned to it. In some cases virtual nodes are disabled, such as when running Hadoop or Solr on top of Cassandra.
  22. Look into the cluster configuration. In our case we were using single token nodes. We had gross balancing issues with certain nodes completely filling their disks and others sitting mostly empty. Why virtual nodes? You no longer have to calculate and assign tokens to each node. Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes responsibility for an even portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster. Rebuilding a dead node is faster because it involves every other node in the cluster. Improves the use of heterogeneous machines in a cluster. You can assign a proportional number of vnodes to smaller and larger machines. How did this help? We no longer had issues with cluster balancing. The deviation in storage use was minimal with all nodes showing equal utilization. It’s worth noting that some of the larger Cassandra clusters deployed today are not using virtual nodes.
  23. Or “Why does my code time out in the development environment, but not staging or production?” Another lesson learned involves the hardware used to run C*. In our case the staging, and production environments were on dedicated hardware, but QA and development clusters were provisioned as VMs
  24. This isn’t necessarily an issue. VMs can be quite performant and there are some people running large C* clusters on virtual platforms. In our case these machines did not have their C* storage on local disks. Instead the mount points were provisioned on a NAS. As we were working to develop features exceptions kept getting raised in the development and qa clusters that were not manifesting locally. Our developers spent time chasing bugs that only existed in these two environments.
  25. Now let’s move on to our next section, Spark.
  26. To understand where Spark fits in to the project, let’s look back at the data layer and how data enters the system. We have two sets of data ingestion jobs on the cluster.
  27. Did it work? Technically, yes Why change it? It didn’t meet the SLA. Even with a fairly large number of processes running we couldn’t meet the re-ingestion SLA requirements How could we make it better? There are two possible approaches Optimize the C2S process add caching multi-thread where possible We ended up doing this. It met the SLA, but just barely. We asked ourselves “What happens when the dataset increases?” Look for a new way to ingest the data
  28. In the new approach the job is submitted to the Spark cluster. Joined data is loaded into a RDD The RDD is mapped into Solr documents Solr documents are batched and pushed to Solr Cloud Q: How did this work? A: Not too well. It was a little faster than the original process, but not by much. There was no major load on the Solr cluster, the bottleneck was definitely within the Spark job. How did we move forward?
  29. Why is my job running so slow? Metrics, Metrics, Metrics
  30. By running the job with metrics enabled. We instrumented every method call with timings and collated the results when the job completed. This painted a pretty clear picture on where we were spinning our wheels.
  31. In our code we create an RDD or Resilient Distributed Dataset with a SQL query to C*. Basically our SQL call is built into multiple C* queries that are then mapped and joined together and returned as a single set of rows. The majority of our work was being done in a foreach on the joined RDD. Each iteration within the foreach loop would connect, send the document, then continue.
  32. The logic which created a connection to the SolrCloud cluster was a huge drain on time. The creation of the HTTP client took 4 times longer than any other part of the job.
  33. Obviously the best solution here would be to use a single connection for sending the documents over. This lead us to our next lesson…
  34. “What’s up with this object not being available in my loop?”
  35. We’ve determined that this should work. We finish writing the job, package it up, and ship it off to the cluster What do we get?
  36. What?! This was supposed to make everything better! The problem here is that the function being executed is packed up and shipped off to the executors. In this case the executors don’t have the Solr connection available.
  37. We’ve determined that this should work. We finish writing the job and hit build. We’re met with an error?! From here we started digging and ended up…
  38. “What do you mean that method doesn’t exist? IT’S IN THE DOCUMENTATION!” Really knowing our API.
  39. What now? Our jobs were written in Java against the Java API with version 0.9.2 of Spark. Scanning the docs revealed the much needed foreachPartition method on RDD, but what we didn’t see was that it’s only available to Scala. When in the land of Java, like we were, we receive a JavaRDD back from the Cassandra SQL call, not a RDD. Now our magical method that should fix everything isn’t here. What’s next?
  40. How about this code? Did it work? Nope, the job ran and never executed the mapPartitions call. In Spark an RDD may have a transformation or action applied to it.
  41. This allows better performance as the entire mapped dataset isn’t returned to the driver, instead the results of the action (usually a smaller value) are returned.
  42. Now with an action, the transformation mapPartitions gets applied, but our driver crashes with an out of memory exception.
  43. Lesson! In our previous bit of code we hit an out of memory exception. This is because all of the rows from the map are being instantiated and returned to the driver, in our case we really don’t need the data.
  44. Here we have instead decided to just return the number of rows processed per partition. These values may be used to provide metrics and keep a tally on the number of records processed while keeping memory usage low.
  45. The new Spark based process was well within the SLA. Provided additional admin features and … 5x decrease in ingestion time