SlideShare a Scribd company logo
1 of 23
Download to read offline
Stavros Kontopoulos
Software Engineer, MSc
Cassandra at Pollfish.com
What we do at Pollfish?
• Target mobile users with surveys through our android/ ios sdk which is installed via thousands
of mobile apps. Developers benefit from completed surveys, companies also may run a survey
campaign in real time. Now analytics/ML pipeline…
Why Apache Cassandra?
• Store time series about system events, user activities, survey results and much more..
• Amazing write throughput, take advantage of idempotent writes with proper resolution.
• Decent read speed throughput and low latency.
• Integrates with spark to implement our analytics, insights pipeline.
• A new business model.
A flashback… why I am here…
Our Tech Team- Disciplinary oriented
• Front End Development - UI Design • Back-end, data engineer(s) • Data Scientist • DevOps
Pollfish High Level Architecture
Mobile Users (~600K active per day)
APP SERVER 1
APP SERVER N
Other systems
(PostGres, Geo, Redis…)
DSE
Cassandra/Spark Cluster
Survey Customers
Why Datastax
Start up program and support when you go to production
Many tools for development and maintenance: Pig, hive, shark, CFS, OpsCenter.
Clear setup and support for real time data storage and analytics in the same cluster.
Can be extended for other workloads like Search (Solr).
Supports multiple DCs easily for other purposes staging, backup etc.
Product version we used: DSE 4.5.1 (spark 0.9.1), 4.5.3, 4.6.0 (Spark 1.1.0)
Our Cassandra Cluster- Setup
2 DataCenters (Cassandra  Analytics):
• 1 for real-time data storage, read/write path.
• 1 for analytics nodes (Spark and Hadoop enabled), read/write path ETL, machine learning.
• Use the DSE setup with DSESimpleSnitch for mixed workloads multi-DC clusters.
3 Nodes per DC (details next), planning for more.
Data is written at Cassandra DC and replicated to Analytics DC for all keyspaces needed.
2 seed nodes, one per DC.
Our Cassandra Cluster- Setup
Our Cassandra Cluster- Setup
Our Cassandra Cluster- Infrastructure Details
Cassandra DC:
• Node 0,1,2: Standard_A4 (8 cores, 14 GB memory), 15DISKs in RAID 0
Analytics DC:
• Node 0,1,2: Standard_D13 (8 cores, 56 GB memory), 15DISKs in RAID 0
A bit about disks (ideally JBOD s but we can only have network disks)…
	 FS type: Ext4
	 Analytics disks per node:
Azure Os disk (/dev/sda1 , mp: /): 29GB
Azure VM tmp disk (/dev/sdb1, mp: /mnt/resource): 400GB - for spark temp files…
data_file_directories: /mnt/dsedata/lib/cassandra/data (2,9T)
commitlog_directory: /var/lib/cassandra/commitlog (197GB disk)
A few tips towards production…
Follow as a start:
http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf
• NTP is a must for all servers, Cassandra needs that for node data synchronization and when you
do analytics
• Proper limits:
		 cassandra - memlock unlimited
		 cassandra - nofile 100000
		 cassandra - nproc 32768
		 cassandra - as unlimited
• Optimum blockdev --setra settings for RAID (OK): All value must be set to 128KB for RA.
• No swap
A few Cassandra optimization tips
For Cassandra DC:
	 • concurrent_reads: 64
	 • concurrent_writes: 32
	 • Adjust according to the load and node IOPS/throughput..
Heap size adjustments: Datastax does a good job to automatically handle this… In general
for production 8GB for heap is ok… to avoid gc pauses… found in bigger jvms. We use 4GB in
smaller Cassandra nodes and 8GB in Analytics.
	 • In case you have a load with short lived objects adjust accordingly the parNew size.
	 Your tools: jconsole,jvisualvm,hprof, jstack, enable GC reports at logging level…
JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDetails”
JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDateStamps”
JVM_OPTS=”$JVM_OPTS -XX:+PrintHeapAtGC“
…
Managing nodes
When you have a running car how you change the tire?
Adding a new clean node to the cluster fast :
• Add the following line to /etc/dse/cassandra/cassandra-env.sh :
• JVM_OPTS=”$JVM_OPTS -Dcassandra.join_ring=false“
• Start the service
• Revert change and execute nodetool join
• After joining finishes execute a node rebalance from OpsCenter.
• Fallback: remove data and commit log dirs. And start again.
Stopping a node:
nodetool drain -h host name
sudo service dse stop
Integration with Spark- Use cases
Commmon ETL case:
Mobile User profile (ML):
HDFS Compatible (CFS)
Cassandra Raw Data
Spark/Cassandra
Spark/Cassandra
HDFS Compatible (CFS)
Cassandra User profile data
Customers Amazon S3
Integration with Spark- Our job framework
• We have built an analytics framework to run spark jobs on top of Cassandra.
Executes jobs remotely through maven.
• SparkContext is created on a non-dse node. Needed some tricks to make it work, now we move
to a more dse official approach with spark submit. Note: We started with 0.9.1 DSE version
where no spark submit was available.
• The framework provides an API to run jobs in production and use common functionality at the
Data science side for code re-use.
Datastax Cassandra Opscenter
• Facilitates: monitoring, health check and daily operations at the cluster level.
• We use it for: node rebalance, repairs,snapshotting, latency throughput checks along with low
level tools like iotop, ioping etc.
Datastax Cassandra Opscenter (Pending cluster operations)
Datastax Cassandra Opscenter (Keyspace Details)
Datastax Cassandra Opscenter (Last 10 days)
The Good and the Bad with Cassandra and DSE
good
• Easy to start with,
enough documentation.
• Reliable performance so far.
• Easily scalable,
easy to add/remove nodes.
• Support.
bad
• Bugs at the Cassandra level and the
Spark level. Need active follow up of the
lists for Spark, Cassandra and DSE tools.
Upgrade maybe the only solution and not
a piece of case.
• Gets tricky to optimize depending on your
load specifically when you don’t have
time to measure everything upfront and
you are in production already.
The Good and the Bad with Cassandra and DSE (Some issue examples)
Apply backpressure gently when overloaded with writes,
https://issues.apache.org/jira/browse/CASSANDRA-7937
Reproduce by writing large volume of data with large column values eg. 300K, crawling scenario...
Spark Cassandra Connector issues (eg. java driver not closing connection with cluster)...
As a developer some cool stuff I mess with
Currently I am extending the framework I have built to deliver spark jobs on top of Cassandra.
• IT with Cassandra embedded connector.
• Cassandra Schema Design.
• ETL use cases.
Organizing dev env:
Optimizing development environment with vagraant, automatic installation - avoid upgrade
headaches, setup and IT level testing. Introduced CI with Jenkins, Artifactory etc…
thank you

More Related Content

What's hot

Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real WorldJeremy Hanna
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyDataStax Academy
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCarlos Alonso Pérez
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityHiromitsu Komatsu
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japanHiromitsu Komatsu
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...DataStax
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writesInstaclustr
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsAcunu
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... CassandraInstaclustr
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Edureka!
 
Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudRevolution Analytics
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraMichael Kjellman
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
 

What's hot (18)

Apache Cassandra in the Real World
Apache Cassandra in the Real WorldApache Cassandra in the Real World
Apache Cassandra in the Real World
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Managing Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al TobeyManaging Cassandra at Scale by Al Tobey
Managing Cassandra at Scale by Al Tobey
 
Cassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one dayCassandra Workshop - Cassandra from scratch in one day
Cassandra Workshop - Cassandra from scratch in one day
 
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayCassandra Summit 2014: Apache Cassandra Best Practices at Ebay
Cassandra Summit 2014: Apache Cassandra Best Practices at Ebay
 
Cassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra CommunityCassandra CLuster Management by Japan Cassandra Community
Cassandra CLuster Management by Japan Cassandra Community
 
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...
 
Instaclustr webinar 2017 feb 08 japan
Instaclustr webinar 2017 feb 08   japanInstaclustr webinar 2017 feb 08   japan
Instaclustr webinar 2017 feb 08 japan
 
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
Tales From the Field: The Wrong Way of Using Cassandra (Carlos Rolo, Pythian)...
 
Micro-batching: High-performance writes
Micro-batching: High-performance writesMicro-batching: High-performance writes
Micro-batching: High-performance writes
 
Advanced Operations
Advanced OperationsAdvanced Operations
Advanced Operations
 
Understanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problemsUnderstanding Cassandra internals to solve real-world problems
Understanding Cassandra internals to solve real-world problems
 
Everyday I’m scaling... Cassandra
Everyday I’m scaling... CassandraEveryday I’m scaling... Cassandra
Everyday I’m scaling... Cassandra
 
Webinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in ProductionWebinar: Diagnosing Apache Cassandra Problems in Production
Webinar: Diagnosing Apache Cassandra Problems in Production
 
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
Apache Cassandra Interview Questions and Answers | Cassandra Tutorial | Cassa...
 
Speeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the CloudSpeeding up R with Parallel Programming in the Cloud
Speeding up R with Parallel Programming in the Cloud
 
Hindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to CassandraHindsight is 20/20: MySQL to Cassandra
Hindsight is 20/20: MySQL to Cassandra
 
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarC* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar
 

Similar to Cassandra at Pollfish

Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideMohammed Fazuluddin
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraDemi Ben-Ari
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster CassandraTzach Livyatan
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databasesjbellis
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkEvan Chan
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalScyllaDB
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...javier ramirez
 

Similar to Cassandra at Pollfish (20)

Cassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction GuideCassandra - A Basic Introduction Guide
Cassandra - A Basic Introduction Guide
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Migrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to CassandraMigrating Data Pipeline from MongoDB to Cassandra
Migrating Data Pipeline from MongoDB to Cassandra
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB,  or how we implemented a 10-times faster CassandraSeastar / ScyllaDB,  or how we implemented a 10-times faster Cassandra
Seastar / ScyllaDB, or how we implemented a 10-times faster Cassandra
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Five Lessons in Distributed Databases
Five Lessons  in Distributed DatabasesFive Lessons  in Distributed Databases
Five Lessons in Distributed Databases
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Breakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and SparkBreakthrough OLAP performance with Cassandra and Spark
Breakthrough OLAP performance with Cassandra and Spark
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Cassandra admin
Cassandra adminCassandra admin
Cassandra admin
 
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data ArchivalGPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
GPS Insight on Using Presto with Scylla for Data Analytics and Data Archival
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...
 
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
Ingesting Over Four Million Rows Per Second With QuestDB Timeseries Database ...
 

Recently uploaded

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Recently uploaded (20)

Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Cassandra at Pollfish

  • 1. Stavros Kontopoulos Software Engineer, MSc Cassandra at Pollfish.com
  • 2.
  • 3. What we do at Pollfish? • Target mobile users with surveys through our android/ ios sdk which is installed via thousands of mobile apps. Developers benefit from completed surveys, companies also may run a survey campaign in real time. Now analytics/ML pipeline… Why Apache Cassandra? • Store time series about system events, user activities, survey results and much more.. • Amazing write throughput, take advantage of idempotent writes with proper resolution. • Decent read speed throughput and low latency. • Integrates with spark to implement our analytics, insights pipeline. • A new business model. A flashback… why I am here…
  • 4. Our Tech Team- Disciplinary oriented • Front End Development - UI Design • Back-end, data engineer(s) • Data Scientist • DevOps
  • 5. Pollfish High Level Architecture Mobile Users (~600K active per day) APP SERVER 1 APP SERVER N Other systems (PostGres, Geo, Redis…) DSE Cassandra/Spark Cluster Survey Customers
  • 6. Why Datastax Start up program and support when you go to production Many tools for development and maintenance: Pig, hive, shark, CFS, OpsCenter. Clear setup and support for real time data storage and analytics in the same cluster. Can be extended for other workloads like Search (Solr). Supports multiple DCs easily for other purposes staging, backup etc. Product version we used: DSE 4.5.1 (spark 0.9.1), 4.5.3, 4.6.0 (Spark 1.1.0)
  • 7. Our Cassandra Cluster- Setup 2 DataCenters (Cassandra Analytics): • 1 for real-time data storage, read/write path. • 1 for analytics nodes (Spark and Hadoop enabled), read/write path ETL, machine learning. • Use the DSE setup with DSESimpleSnitch for mixed workloads multi-DC clusters. 3 Nodes per DC (details next), planning for more. Data is written at Cassandra DC and replicated to Analytics DC for all keyspaces needed. 2 seed nodes, one per DC.
  • 10. Our Cassandra Cluster- Infrastructure Details Cassandra DC: • Node 0,1,2: Standard_A4 (8 cores, 14 GB memory), 15DISKs in RAID 0 Analytics DC: • Node 0,1,2: Standard_D13 (8 cores, 56 GB memory), 15DISKs in RAID 0 A bit about disks (ideally JBOD s but we can only have network disks)… FS type: Ext4 Analytics disks per node: Azure Os disk (/dev/sda1 , mp: /): 29GB Azure VM tmp disk (/dev/sdb1, mp: /mnt/resource): 400GB - for spark temp files… data_file_directories: /mnt/dsedata/lib/cassandra/data (2,9T) commitlog_directory: /var/lib/cassandra/commitlog (197GB disk)
  • 11. A few tips towards production… Follow as a start: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html http://www.datastax.com/wp-content/uploads/2014/04/WP-DataStax-Enterprise-Best-Practices.pdf • NTP is a must for all servers, Cassandra needs that for node data synchronization and when you do analytics • Proper limits: cassandra - memlock unlimited cassandra - nofile 100000 cassandra - nproc 32768 cassandra - as unlimited • Optimum blockdev --setra settings for RAID (OK): All value must be set to 128KB for RA. • No swap
  • 12. A few Cassandra optimization tips For Cassandra DC: • concurrent_reads: 64 • concurrent_writes: 32 • Adjust according to the load and node IOPS/throughput.. Heap size adjustments: Datastax does a good job to automatically handle this… In general for production 8GB for heap is ok… to avoid gc pauses… found in bigger jvms. We use 4GB in smaller Cassandra nodes and 8GB in Analytics. • In case you have a load with short lived objects adjust accordingly the parNew size. Your tools: jconsole,jvisualvm,hprof, jstack, enable GC reports at logging level… JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDetails” JVM_OPTS=”$JVM_OPTS -XX:+PrintGCDateStamps” JVM_OPTS=”$JVM_OPTS -XX:+PrintHeapAtGC“ …
  • 13. Managing nodes When you have a running car how you change the tire? Adding a new clean node to the cluster fast : • Add the following line to /etc/dse/cassandra/cassandra-env.sh : • JVM_OPTS=”$JVM_OPTS -Dcassandra.join_ring=false“ • Start the service • Revert change and execute nodetool join • After joining finishes execute a node rebalance from OpsCenter. • Fallback: remove data and commit log dirs. And start again. Stopping a node: nodetool drain -h host name sudo service dse stop
  • 14. Integration with Spark- Use cases Commmon ETL case: Mobile User profile (ML): HDFS Compatible (CFS) Cassandra Raw Data Spark/Cassandra Spark/Cassandra HDFS Compatible (CFS) Cassandra User profile data Customers Amazon S3
  • 15. Integration with Spark- Our job framework • We have built an analytics framework to run spark jobs on top of Cassandra. Executes jobs remotely through maven. • SparkContext is created on a non-dse node. Needed some tricks to make it work, now we move to a more dse official approach with spark submit. Note: We started with 0.9.1 DSE version where no spark submit was available. • The framework provides an API to run jobs in production and use common functionality at the Data science side for code re-use.
  • 16. Datastax Cassandra Opscenter • Facilitates: monitoring, health check and daily operations at the cluster level. • We use it for: node rebalance, repairs,snapshotting, latency throughput checks along with low level tools like iotop, ioping etc.
  • 17. Datastax Cassandra Opscenter (Pending cluster operations)
  • 18. Datastax Cassandra Opscenter (Keyspace Details)
  • 20. The Good and the Bad with Cassandra and DSE good • Easy to start with, enough documentation. • Reliable performance so far. • Easily scalable, easy to add/remove nodes. • Support. bad • Bugs at the Cassandra level and the Spark level. Need active follow up of the lists for Spark, Cassandra and DSE tools. Upgrade maybe the only solution and not a piece of case. • Gets tricky to optimize depending on your load specifically when you don’t have time to measure everything upfront and you are in production already.
  • 21. The Good and the Bad with Cassandra and DSE (Some issue examples) Apply backpressure gently when overloaded with writes, https://issues.apache.org/jira/browse/CASSANDRA-7937 Reproduce by writing large volume of data with large column values eg. 300K, crawling scenario... Spark Cassandra Connector issues (eg. java driver not closing connection with cluster)...
  • 22. As a developer some cool stuff I mess with Currently I am extending the framework I have built to deliver spark jobs on top of Cassandra. • IT with Cassandra embedded connector. • Cassandra Schema Design. • ETL use cases. Organizing dev env: Optimizing development environment with vagraant, automatic installation - avoid upgrade headaches, setup and IT level testing. Introduced CI with Jenkins, Artifactory etc…