The document discusses different data storage options for small, medium, and large datasets. It argues that relational databases do not scale well for large datasets due to limitations with replication, normalization, sharding, and high availability. The document then introduces Apache Cassandra as a fast, distributed, highly available, and linearly scalable database that addresses these limitations through its use of a hash ring architecture and tunable consistency levels. It describes Cassandra's key features including replication, compaction, and multi-datacenter support.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.
About Robbie Strickland, Software Development Manager at The Weather Channel
Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Cassandra Community Webinar: Apache Spark Analytics at The Weather Channel - ...DataStax Academy
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. But there are serious advantages to many of the new tools, and this presentation will give an analysis of the current state–including pros and cons as well as what’s needed to bootstrap and operate the various options.
About Robbie Strickland, Software Development Manager at The Weather Channel
Robbie works for The Weather Channel’s digital division as part of the team that builds backend services for weather.com and the TWC mobile apps. He has been involved in the Cassandra project since 2010 and has contributed in a variety of ways over the years; this includes work on drivers for Scala and C#, the Hadoop integration, heading up the Atlanta Cassandra Users Group, and answering lots of Stack Overflow questions.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.
DataStax recently announced the general availability of DataStax Enterprise 4.7 (DSE 4.7), the leading database platform purpose-built for the performance and availability demands of web, mobile, and IOT applications. In this product launch webinar, Robin Schumacher, VP of Products, explores the wide range of enhancements in DSE 4.7 including enterprise class search, analytics, and in-memory.
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
We'll be covering some aspects of our architecture, highlighting differences between MongoDB and Cassandra. We'll go in depth to explain why Cassandra is a better choice for our general purpose Application Platform (SHIFT) as well as our Media Buying Analytics tool (the SHIFT Media Manager). We'll be going over common design patterns people might be familiar with coming from a background with MongoDB and highlight how Cassandra would be used as a better alternative. We'll also touch more on cqlengine which is nearing feature completeness as the Cassandra object mapper for Python.
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDBScyllaDB
The increasing demand to manage real-time data (RTD) resulted in growing adoption of stream processing systems. Organizations can no longer wait for nightly batch jobs to process data and then take actions. In this talk we show how the powerful combination of KSQL, Kafka and ScyllaDB can help you implement scalable stream processing applications. We present a real-time streaming pipeline where massive amounts of data are ingested into Kafka, then processed by KSQL to keep the real-time results in Scylla tables. Whenever you query the Scylla tables you are sure you have the latest results at your fingertips.
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
We will present our O365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DataStax Enterprise on azure.
Many NoSQL DBaaS vendors limit what cloud platform you can run on, the size of the data you can run and require you to over-provision cloud infrastructure resources while failing to deliver performance and low latency at scale.
In this session, we will compare the performance and Total Cost of Ownership (TCO) of competing NoSQL DBaaS offerings. We will also review how to migrate to Scylla Cloud, our fully managed database service.
You will learn:
- The true cost of ownership for selected NoSQL DBaaS offerings
- The 8 essentials for selecting a NoSQL DBaaS
- Migration options from Apache Cassandra, DynamoDB and other databases
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...ScyllaDB
Many organizations struggle to balance traditional big data infrastructure with NoSQL databases. Other organizations do the smart thing and consolidate the two. This presentation explores Numberly’s experience migrating an intensive and join hungry production workload from MongoDB and Hive to Scylla. Using Scylla, we were able to accommodate a join of billions of rows in seconds, while also dramatically reducing operational and development complexity by using a single database for our hybrid analytical use case. As a bonus, we’ll cover benchmarks for Dask (a flexible parallel computing library for analytic computing) and Spark, highlighting their differences and lessons learned along the way.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Cassandra at eBay - Cassandra Summit 2013Jay Patel
"Buy It Now! Cassandra at eBay" talk at Cassandra Summit 2013
This session will cover various use cases for Cassandra at eBay. It’ll start with overview of eBay’s heterogeneous data platform comprised of SQL & NoSQL databases, and where Cassandra fits into that. For each use case, Jay will go into detail of system design, data model & multi-datacenter deployment. To conclude, Jay will summarize the best practices that guide Cassandra utilization at eBay.
http://www.datastax.com/company/news-and-events/events/cassandrasummit2013
Scylla Summit 2016: Why Kenshoo is about to displace Cassandra with ScyllaScyllaDB
Kenshoo is a leader in digital marketing with very heavy data usage. Learn about their big data challenges, the tools that they use, and their experience evaluating Scylla.
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
Monitoring provides information on system performance, however tracing is necessary to understand individual request performance. Detailed query tracing has been provided by Cassandra since version 1.2 and is invaluable when diagnosing problems. Although knowing what queries to trace and why the application makes them still requires deep technical knowledge. By merging Application tracing via Zipkin and Cassandra query tracing we automate the process and make it easier to identify and resolve problems. In this talk Mick Semb Wever, Team Member at The Last Pickle, will introduce Cassandra query tracing and Zipkin. He will then propose an extension that allows clients to pass a trace identifier through to Cassandra, and a way to integrate Zipkin tracing into Cassandra. Driving all this is the desire to create one tracing view across the entire system.
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
With Apache Cassandra being a massively scalable open source NoSQL database and with the amount of data that we create and copy annually which is doubling in size every two years, it is expected to reach 44 zettabytes, or 44 trillion gigabytes, we can assume that sooner or later a DBA will be handling a Cassandra database in their shop. This beginner/intermediate-level session will take you through my journey of an Oracle DBA and my first 100 days of starting to administer a Cassandra Cluster, show several demos and all the roadblocks and the success I had along this path.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
Looking to strengthen your expertise of Cassandra and DataStax Enterprise? This DataStax Training Webinar will arm you with the knowledge and hands-on skills to get the most out of your DataStax Enterprise environment. If you’ve already taken a DataStax training, consider this a free refresher. Considering training? Then this is a solid intro for developers and admins on your team.
This webinar will highlight the training curriculum and drill into each of the Cassandra expert-led courses so you can determine what meets your needs. Training topics:
Core Concepts, Skills, and Tools
Operations & Performance Tuning
Data Modeling
Using Apache Solr within DataStax Enterprise
And more!
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
n this talk we will address how we developed our Cassandra environments utilizing Cisco UCS Open Stack Platform with the DataStax Enterprise Edition software. In addition we are utilizing OpenSource CEPH storage in our Infrastructure to optimize the Performance and reduce the costs.
C* Summit 2013: Searching for a Needle in a Big Data Haystack by Jason Ruther...DataStax Academy
The presentation demonstrates how Solr may be used to create real-time analytics applications. In addition, Datastax Enterprise 3.0 will be showcased, which offers Solr version 4.0 with a number of improvements over the previous DSE release. A realtime financial application will run for the audience, and then a detailed look at how the application was built. An overview of Datastax Enterprise Solr features will be given, and how the many enhancements in DSE make it unique in the marketplace.
DataStax recently announced the general availability of DataStax Enterprise 4.7 (DSE 4.7), the leading database platform purpose-built for the performance and availability demands of web, mobile, and IOT applications. In this product launch webinar, Robin Schumacher, VP of Products, explores the wide range of enhancements in DSE 4.7 including enterprise class search, analytics, and in-memory.
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
Cassandra Community Webinar: From Mongo to Cassandra, Architectural LessonsDataStax
We'll be covering some aspects of our architecture, highlighting differences between MongoDB and Cassandra. We'll go in depth to explain why Cassandra is a better choice for our general purpose Application Platform (SHIFT) as well as our Media Buying Analytics tool (the SHIFT Media Manager). We'll be going over common design patterns people might be familiar with coming from a background with MongoDB and highlight how Cassandra would be used as a better alternative. We'll also touch more on cqlengine which is nearing feature completeness as the Cassandra object mapper for Python.
Scylla Summit 2018: Scalable Stream Processing with KSQL, Kafka and ScyllaDBScyllaDB
The increasing demand to manage real-time data (RTD) resulted in growing adoption of stream processing systems. Organizations can no longer wait for nightly batch jobs to process data and then take actions. In this talk we show how the powerful combination of KSQL, Kafka and ScyllaDB can help you implement scalable stream processing applications. We present a real-time streaming pipeline where massive amounts of data are ingested into Kafka, then processed by KSQL to keep the real-time results in Scylla tables. Whenever you query the Scylla tables you are sure you have the latest results at your fingertips.
Azure + DataStax Enterprise Powers Office 365 Per User StoreDataStax Academy
We will present our O365 use case scenarios, why we chose Cassandra + Spark, and walk through the architecture we chose for running DataStax Enterprise on azure.
Many NoSQL DBaaS vendors limit what cloud platform you can run on, the size of the data you can run and require you to over-provision cloud infrastructure resources while failing to deliver performance and low latency at scale.
In this session, we will compare the performance and Total Cost of Ownership (TCO) of competing NoSQL DBaaS offerings. We will also review how to migrate to Scylla Cloud, our fully managed database service.
You will learn:
- The true cost of ownership for selected NoSQL DBaaS offerings
- The 8 essentials for selecting a NoSQL DBaaS
- Migration options from Apache Cassandra, DynamoDB and other databases
Scylla Summit 2018: Joining Billions of Rows in Seconds with One Database Ins...ScyllaDB
Many organizations struggle to balance traditional big data infrastructure with NoSQL databases. Other organizations do the smart thing and consolidate the two. This presentation explores Numberly’s experience migrating an intensive and join hungry production workload from MongoDB and Hive to Scylla. Using Scylla, we were able to accommodate a join of billions of rows in seconds, while also dramatically reducing operational and development complexity by using a single database for our hybrid analytical use case. As a bonus, we’ll cover benchmarks for Dask (a flexible parallel computing library for analytic computing) and Spark, highlighting their differences and lessons learned along the way.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Cassandra at eBay - Cassandra Summit 2013Jay Patel
"Buy It Now! Cassandra at eBay" talk at Cassandra Summit 2013
This session will cover various use cases for Cassandra at eBay. It’ll start with overview of eBay’s heterogeneous data platform comprised of SQL & NoSQL databases, and where Cassandra fits into that. For each use case, Jay will go into detail of system design, data model & multi-datacenter deployment. To conclude, Jay will summarize the best practices that guide Cassandra utilization at eBay.
http://www.datastax.com/company/news-and-events/events/cassandrasummit2013
Scylla Summit 2016: Why Kenshoo is about to displace Cassandra with ScyllaScyllaDB
Kenshoo is a leader in digital marketing with very heavy data usage. Learn about their big data challenges, the tools that they use, and their experience evaluating Scylla.
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
Monitoring provides information on system performance, however tracing is necessary to understand individual request performance. Detailed query tracing has been provided by Cassandra since version 1.2 and is invaluable when diagnosing problems. Although knowing what queries to trace and why the application makes them still requires deep technical knowledge. By merging Application tracing via Zipkin and Cassandra query tracing we automate the process and make it easier to identify and resolve problems. In this talk Mick Semb Wever, Team Member at The Last Pickle, will introduce Cassandra query tracing and Zipkin. He will then propose an extension that allows clients to pass a trace identifier through to Cassandra, and a way to integrate Zipkin tracing into Cassandra. Driving all this is the desire to create one tracing view across the entire system.
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
With Apache Cassandra being a massively scalable open source NoSQL database and with the amount of data that we create and copy annually which is doubling in size every two years, it is expected to reach 44 zettabytes, or 44 trillion gigabytes, we can assume that sooner or later a DBA will be handling a Cassandra database in their shop. This beginner/intermediate-level session will take you through my journey of an Oracle DBA and my first 100 days of starting to administer a Cassandra Cluster, show several demos and all the roadblocks and the success I had along this path.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
Webinar: DataStax Training - Everything you need to become a Cassandra RockstarDataStax
Looking to strengthen your expertise of Cassandra and DataStax Enterprise? This DataStax Training Webinar will arm you with the knowledge and hands-on skills to get the most out of your DataStax Enterprise environment. If you’ve already taken a DataStax training, consider this a free refresher. Considering training? Then this is a solid intro for developers and admins on your team.
This webinar will highlight the training curriculum and drill into each of the Cassandra expert-led courses so you can determine what meets your needs. Training topics:
Core Concepts, Skills, and Tools
Operations & Performance Tuning
Data Modeling
Using Apache Solr within DataStax Enterprise
And more!
Cisco: Cassandra adoption on Cisco UCS & OpenStackDataStax Academy
n this talk we will address how we developed our Cassandra environments utilizing Cisco UCS Open Stack Platform with the DataStax Enterprise Edition software. In addition we are utilizing OpenSource CEPH storage in our Infrastructure to optimize the Performance and reduce the costs.
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
This is a two part talk in which we'll go over the architecture that enables Apache Cassandra’s linear scalability as well as how DataStax Drivers are able to take full advantage of it to provide developers with nicely designed and speedy clients extendable to the core.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
To view the full-length video and tutorial, visit: https://academy.datastax.com/demos/getting-started-graph-databases
Getting Started with Graph Databases contains a brief overview of RDBMS architecture in comparison to graph, basic graph terminology, a real-world use case for graph, and an overview of Gremlin, the standard graph query language found in TinkerPop.
Apache Cassandra is a leading open-source distributed database capable of amazing feats of scale, but its data model requires a bit of planning for it to perform well. Of course, the nature of ad-hoc data exploration and analysis requires that we be able to ask questions we hadn’t planned on asking—and get an answer fast. Enter Apache Spark.
Spark is a distributed computation framework optimized to work in-memory, and heavily influenced by concepts from functional programming languages. It’s exactly what a Cassandra cluster needs to deliver real-time, ad-hoc querying of operational data at scale.
In this talk, we’ll explore Spark and see how it works together with Cassandra to deliver a powerful open-source big data analytic solution.
Most people hear "Spark" and think "Analytics". But the ability of Spark to efficiently distribute and manage a full-table traversal while functionally transforming the data make it perfectly suited to executing "Big Data" maintenance job
One is the loneliest number
Much, much worse than two
Many of PagerDuty’s mission-critical services are based on Cassandra, and as a result we have built up a lot of operational experience over the past few years. Unfortunately, some of our best learnings have come from sizeable failures in production. One of those failures stemmed from having multiple services share the same Cassandra cluster, which was a major factor in PagerDuty’s largest outage of 2014. This talk will relive that outage, sort through the wreckage, and explain why isolating your Cassandra clusters is a best practice you should adopt
Cassandra is pretty awesome, sure I am biased, but it rocks. Always on, tuneable consistency and multi-master architecture? Let’s get our web scale on and build a highly available app that never goes down!
Hold on a second. There is one key piece of the puzzle that has a massive impact on your applications availability: the client driver.
In this talk we will go through the how to best configure your clients to make the most of failure handling and tuneable consistency in Cassandra.
Intro deck from Cassandra Day Atlanta. Covers the evolution of data storage and analysis, the architecture of Cassandra, the read & write path, and using Cassandra for analytics. By Jon Haddad & Luke Tillman
An introduction to core concepts in Apache Cassandra. We cover the evolution of database architecture as you try to scale a relational database to solve big data problems, and explain how Cassandra handles these problems efficiently.
These are the slides from my talk at Hulu in March 2015 discussing Apache Spark & Cassandra. I cover the evolution of data from a single machine to RDBMS (MySQL is the primary example) to big data systems.
On the Spark side, I covered batch jobs, streaming, Apache Kafka, an introduction to machine learning, clustering, logistic regression and recommendations systems (collaborative filtering).
The talk was recorded and is available on youtube: https://www.youtube.com/watch?v=_gFgU3phogQ
Caches are used in many layers of applications that we develop today, holding data inside or outside of your runtime environment, or even distributed across multiple platforms in data fabrics. However, considerable performance gains can often be realized by configuring the deployment platform/environment and coding your application to take advantage of the properties of CPU caches.
In this talk, we will explore what CPU caches are, how they work and how to measure your JVM-based application data usage to utilize them for maximum efficiency. We will discuss the future of CPU caches in a many-core world, as well as advancements that will soon arrive such as HP's Memristor.
What Every Developer Should Know About Database Scalabilityjbellis
Replication. Partitioning. Relational databases. Bigtable. Dynamo. There is no one-size-fits-all approach to scaling your database, and the CAP theorem proved that there never will be. This talk will explain the advantages and limits of the approaches to scaling traditional relational databases, as well as the tradeoffs made by the designers of newer distributed systems like Cassandra. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7955
Cassandra Summit 2014: Deploying Cassandra for Call of DutyDataStax Academy
Presenters: Seán O Sullivan, Service Reliability Engineer & Tim Czerniak, Software Engineer at Demonware
This presentation covers the eight-month evaluation process we underwent to migrate some of Call of Duty’s core services from MySQL to Cassandra. We will outline our requirements, the process we followed for the evaluation, decisions we made around our schema, configuration and hardware, and some issues we encountered.
This is an introduction to relational and non-relational databases and how their performance affects scaling a web application.
This is a recording of a guest Lecture I gave at the University of Texas school of Information.
In this talk I address the technologies and tools Gowalla (gowalla.com) uses including memcache, redis and cassandra.
Find more on my blog:
http://schneems.com
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedInLinkedIn
Jay Kreps on Project Voldemort Scaling Simple Storage At LinkedIn. This was a presentation made at QCon 2009 and is embedded on LinkedIn's blog - http://blog.linkedin.com/
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Lars Marowsky-Brée
A presentation discussing various aspects that affect performance of Ceph clusters, and how to map, model, and predict their performance.
This lays the groundwork for building a Ceph cluster measurement and benchmark suite that eventually will build up a data corpus on performance characteristics that can be used to answer these key questions:
- How to build a storage system that meets my requirements?
- If I build a system like this, what will its characteristics be?
- If I change XY in my existing system, how will its characteristics change?
1) Apache Cassandra in term of CAP Theorem
2) What makes Apache Cassandra "Available"?
3) How Apache Cassandra ensures data consistency?
4) Cassandra advantages and disadvantages
5) Frameworks/libraries to access Apache Cassandra + performance comparison
Cassandra Day London 2015: Introduction to Apache Cassandra and DataStax Ente...DataStax Academy
Speaker(s): Jon Haddad, Apache Cassandra Evangelist and Luke Tillman, Apache Cassandra Language Evangelist at DataStax
This is a crash course introduction to Cassandra. You'll step away understanding how it's possible to to utilize this distributed database to achieve high availability across multiple data centers, scale out as your needs grow, and not be woken up at 3am just because a server failed. We'll cover the basics of data modeling with CQL, and understand how that data is stored on disk. We'll wrap things up by setting up Cassandra locally, so bring your laptops.
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
You’ve heard all of the hype, but how can SMACK work for you? In this all-star lineup, you will learn how to create a reactive, scaling, resilient and performant data processing powerhouse. Bringing Akka, Kafka and Mesos together provides a foundation to develop and operate an elastically scalable actor system. We will go through the basics of Akka, Kafka and Mesos and then deep dive into putting them together in an end2end (and back again) distrubuted transaction. Distributed transactions mean producers waiting for one or more of consumers to respond. We'll also go through automated ways to failure induce these systems (using LinkedIn Simoorg) and trace them from start to stop through each component (using Twitters Zipkin). Finally, you will see how Apache Cassandra and Spark can be combined to add the incredibly scaling storage and data analysis needed in fast data pipelines. With these technologies as a foundation, you have the assurance that scale is never a problem and uptime is default.
Tales From The Front: An Architecture For Multi-Data Center Scalable Applicat...DataStax Academy
- Quick review of Cassandra functionality that applies to this use case
- Common Data Center and application architectures for highly available inventory applications, and why the were designed that way
- Cassandra implementations vis-a-vis infrastructure capabilities
The impedance mismatch: compromises made to fit into IT infrastructures designed and implemented with an old mindset
A general rule of thumb talk aimed at late bloomers, managers, directors and architects who have yet to adopt Cassandra.
Covers:
- what not to do.
- operational setup
- data modeling
- performance tuning
- capacity planning
- advanced use cases
Presentation Type: Getting Started: Cassandra for the Relational Developer
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
2. Small Data
• 100's of MB to low GB, single user
• sed, awk, grep are great
• sqlite
• Limitations:
• bad for multiple concurrent users (file sharing!)
3. Medium Data
• Fits on 1 machine
• RDBMS is fine
• postgres
• mysql
• Supports hundreds of concurrent
users
• ACID makes us feel good
• Scales vertically
5. Replication: ACID is a lie
Client
Master Slave
replication lag
Consistent results? Nope!
6. Third Normal Form Doesn't Scale
• Queries are unpredictable
• Users are impatient
• Data must be denormalized
• If data > memory, you = history
• Disk seeks are the worst
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
AS distance,
CASE region
WHEN '$region' THEN 1
ELSE 0
END AS region_match
FROM `cities`
$where and foo_count > 5
ORDER BY region_match desc, foo_count desc
limit 0, 11)
UNION
(SELECT
CONCAT(city_name,', ',region) value,
latitude,
longitude,
id,
population,
( 3959 * acos( cos( radians($latitude) ) *
cos( radians( latitude ) ) * cos( radians( longitude )
- radians($longitude) ) + sin( radians($latitude) ) *
sin( radians( latitude ) ) ) )
7. Sharding is a Nightmare
• Data is all over the place
• No more joins
• No more aggregations
• Denormalize all the things
• Querying secondary indexes
requires hitting every shard
• Adding shards requires manually
moving data
• Schema changes
8. High Availability.. not really
• Master failover… who's responsible?
• Another moving part…
• Bolted on hack
• Multi-DC is a mess
• Downtime is frequent
• Change database settings (innodb buffer
pool, etc)
• Drive, power supply failures
• OS updates
9. Summary of Failure
• Scaling is a pain
• ACID is naive at best
• You aren't consistent
• Re-sharding is a manual process
• We're going to denormalize for
performance
• High availability is complicated,
requires additional operational
overhead
10. Lessons Learned
• Consistency is not practical
• So we give it up
• Manual sharding & rebalancing is hard
• So let's build in
• Every moving part makes systems more complex
• So let's simplify our architecture - no more master / slave
• Scaling up is expensive
• We want commodity hardware
• Scatter / gather no good
• We denormalize for real time query performance
• Goal is to always hit 1 machine
11. What is Apache Cassandra?
• Fast Distributed Database
• High Availability
• Linear Scalability
• Predictable Performance
• No SPOF
• Multi-DC
• Commodity Hardware
• Easy to manage operationally
• Not a drop in replacement for
RDBMS
12. Hash Ring
• No master / slave / replica sets
• No config servers, zookeeper
• Data is partitioned around the ring
• Data is replicated to RF=N servers
• All nodes hold data and can answer
queries (both reads & writes)
• Location of data on ring is
determined by partition key
13. CAP Tradeoffs
• Impossible to be both consistent and
highly available during a network
partition
• Latency between data centers also
makes consistency impractical
• Cassandra chooses Availability &
Partition Tolerance over Consistency
14. Replication
• Data is replicated automatically
• You pick number of servers
• Called “replication factor” or RF
• Data is ALWAYS replicated to each
replica
• If a machine is down, missing data
is replayed via hinted handoff
15. Consistency Levels
• Per query consistency
• ALL, QUORUM, ONE
• How many replicas for query to respond OK
16. Multi DC
• Typical usage: clients write to local
DC, replicates async to other DCs
• Replication factor per keyspace per
datacenter
• Datacenters can be physical or logical
18. The Write Path
• Writes are written to any node in the cluster
(coordinator)
• Writes are written to commit log, then to
memtable
• Every write includes a timestamp
• Memtable flushed to disk periodically
(sstable)
• New memtable is created in memory
• Deletes are a special write case, called a
“tombstone”
19. Compaction
• sstables are immutable
• updates are written to new sstables
• eventually we have too many files on disk
• Merged through compaction, only latest
data is kept based on timestamp
sstable sstable sstable
sstable
20. The Read Path
• Any server may be queried, it acts as the
coordinator
• Contacts nodes with the requested key
• On each node, data is pulled from
SSTables and merged
• Consistency< ALL performs read repair
in background (read_repair_chance)