Cassandra is used for real-time bidding in online advertising. It processes billions of bid requests per day with low latency requirements. Segment data, which assigns product or service affinity to user groups, is stored in Cassandra to reduce calculations and allow users to be bid on sooner. Tuning the cache size and understanding the active dataset helps optimize performance.
This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load.
Micro-batching combines writes for the same partition key into a single network request and ensures they hit the "fast path" for writes on a Cassandra node.
About the Speaker
Adam Zegelin Technical Co-founder, Instaclustr
As Instaclustrs founding software engineer, Adam provides the foundation knowledge of our capability and engineering environment. He delivers business-focused value to our code-base and overall capability architecture. Adam is also focused on providing Instaclustr's contribution to the broader open source community on which our products and services rely, including Apache Cassandra, Apache Spark and other technologies such as CoreOS and Docker.
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
With Apache Cassandra being a massively scalable open source NoSQL database and with the amount of data that we create and copy annually which is doubling in size every two years, it is expected to reach 44 zettabytes, or 44 trillion gigabytes, we can assume that sooner or later a DBA will be handling a Cassandra database in their shop. This beginner/intermediate-level session will take you through my journey of an Oracle DBA and my first 100 days of starting to administer a Cassandra Cluster, show several demos and all the roadblocks and the success I had along this path.
Webinar: Getting Started with Apache CassandraDataStax
Would you like to learn how to use Cassandra but don’t know where to begin? Want to get your feet wet but you’re lost in the desert? Longing for a cluster when you don’t even know how to set up a node? Then look no further! Rebecca Mills, Junior Evangelist at Datastax, will guide you in the webinar “Getting Started with Apache Cassandra...”
You'll get an overview of Planet Cassandra’s resources to get you started quickly and easily. Rebecca will take you down the path that's right for you, whether you are a developer or administrator. Join if you are interested in getting Cassandra up and working in the way that suits you best.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load.
Micro-batching combines writes for the same partition key into a single network request and ensures they hit the "fast path" for writes on a Cassandra node.
About the Speaker
Adam Zegelin Technical Co-founder, Instaclustr
As Instaclustrs founding software engineer, Adam provides the foundation knowledge of our capability and engineering environment. He delivers business-focused value to our code-base and overall capability architecture. Adam is also focused on providing Instaclustr's contribution to the broader open source community on which our products and services rely, including Apache Cassandra, Apache Spark and other technologies such as CoreOS and Docker.
Pythian: My First 100 days with a Cassandra ClusterDataStax Academy
With Apache Cassandra being a massively scalable open source NoSQL database and with the amount of data that we create and copy annually which is doubling in size every two years, it is expected to reach 44 zettabytes, or 44 trillion gigabytes, we can assume that sooner or later a DBA will be handling a Cassandra database in their shop. This beginner/intermediate-level session will take you through my journey of an Oracle DBA and my first 100 days of starting to administer a Cassandra Cluster, show several demos and all the roadblocks and the success I had along this path.
Webinar: Getting Started with Apache CassandraDataStax
Would you like to learn how to use Cassandra but don’t know where to begin? Want to get your feet wet but you’re lost in the desert? Longing for a cluster when you don’t even know how to set up a node? Then look no further! Rebecca Mills, Junior Evangelist at Datastax, will guide you in the webinar “Getting Started with Apache Cassandra...”
You'll get an overview of Planet Cassandra’s resources to get you started quickly and easily. Rebecca will take you down the path that's right for you, whether you are a developer or administrator. Join if you are interested in getting Cassandra up and working in the way that suits you best.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.
Adam Zegelin is Instaclustr's founding software engineer. This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load. Micro-batching combines writes for the same partition key into a single network request and ensures they hit the “fast path” for writes on a Cassandra node.
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...DataStax
You write with QUORUM, you read with QUORUM. You're safe, right?
Although it may seem that way, you could read a different value than the one you wrote - even if nobody else wrote after you. One way this can happen is if the time on the machines in your cluster is not synchronized closely enough. This is called clock skew, and is just one of the ways you'll see that this anomaly can occur.
In this talk we'll dive in to how Cassandra handles conflicting data, walk through several weird and seemingly impossible situations that can happen (both with and without clock skew), and see what we can do to work around them.
About the Speaker
Donny Nadolny Senior Developer, PagerDuty
Donny Nadolny is a Scala developer at PagerDuty, working on improving the reliability of their backend systems. He spends a large amount of time investigating problems experienced with distributed systems like Cassandra and ZooKeeper.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
Go90 is a mobile entertainment platform offering access to live and on demand videos. We built the web services platform and social features like activity feed for go90 by making heavy use of Cassandra and Scala, and would like to share what we learned during development and while operating go90. In this presentation, we cover our data model evolution from the initial prototypes to the current production version and the significant performance gain by using a better data model. We will explain how we apply time series data modeling and the benefits of using expiring columns with DateTieredCompactionStrategy. We will also talk about interesting experiences related to table modifications, tombstones and table pagination. On the operations side, we will discuss our findings on java driver usage, performance, monitoring, cluster maintenance, version upgrade, 2-way ssl and many more. We hope you can learn from our mistakes instead of making them yourself!
About the Speakers
Christopher Webster Software Engineer, AOL
Christopher Webster works on the web services platform for the go90 AOL project. Previously he was a Computer Scientist for the Mission Control Technologies project at NASA Ames Center. Chris worked as a senior staff engineer at Sun Microsystems for Project zembly, the cloud development and deployment environment as well as technical lead in many NetBeans projects. Chris is an author of the NetBeans Field Guide and Assemble the Social Web With Zembly.
Thomas Ng Software Engineer, AOL
Thomas Ng is a software engineer at AOL, building web services for the go90 mobile entertainment platform using Cassandra, Scala and Kafka.
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
About the Speaker
Julien Anguenot VP Software Engineering, iland Internet Solutions, Corp
Julien currently serves as iland's Vice President of Software Engineering. Prior to joining iland, Mr. Anguenot held tech leadership positions at several open source content management vendors and tech startups in Europe and in the U.S. Julien is a long time Open Source software advocate, contributor and speaker: Zope, ZODB, Nuxeo contributor, Zope and OpenStack foundations member, his talks includes Apache Con, Cassandra summit, OpenStack summit, The WWW Conference or still EuroPython.
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
Lessons learned from a year spent building a Cassandra cluster over multiple regions, data centers, and providers. Will discuss our successes and learnings on replication, operations, and application development.
About the Speaker
Aaron Ploetz Lead Technical Architect, Target
Aaron is a Lead Technical Architect for Target, where he coaches development teams on modeling and building applications for Cassandra. He is active in the Cassandra tags on StackOverflow, and has also contributed patches to cqlsh. Aaron holds a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, a M.S. in Software Engineering and Database Technologies from Regis University, and is a 2x DataStax MVP for Apache Cassandra.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag JambhekarDataStax Academy
We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.
Adam Zegelin is Instaclustr's founding software engineer. This presentation will investigate how using micro-batching for submitting writes to Cassandra can improve throughput and reduce client application CPU load. Micro-batching combines writes for the same partition key into a single network request and ensures they hit the “fast path” for writes on a Cassandra node.
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
Webinar: Diagnosing Apache Cassandra Problems in ProductionDataStax Academy
This session covers diagnosing and solving common problems encountered in production, using performance profiling tools. We’ll also give a crash course to basic JVM garbage collection tuning. Viewers will leave with a better understanding of what they should look for when they encounter problems with their in-production Cassandra cluster.
Clock Skew and Other Annoying Realities in Distributed Systems (Donny Nadolny...DataStax
You write with QUORUM, you read with QUORUM. You're safe, right?
Although it may seem that way, you could read a different value than the one you wrote - even if nobody else wrote after you. One way this can happen is if the time on the machines in your cluster is not synchronized closely enough. This is called clock skew, and is just one of the ways you'll see that this anomaly can occur.
In this talk we'll dive in to how Cassandra handles conflicting data, walk through several weird and seemingly impossible situations that can happen (both with and without clock skew), and see what we can do to work around them.
About the Speaker
Donny Nadolny Senior Developer, PagerDuty
Donny Nadolny is a Scala developer at PagerDuty, working on improving the reliability of their backend systems. He spends a large amount of time investigating problems experienced with distributed systems like Cassandra and ZooKeeper.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
Cassandra Summit 2014: Apache Cassandra Best Practices at EbayDataStax Academy
Presenter: Feng Qu, Principal DBA at eBay
Cassandra has been adopted widely at eBay in recent years and used by many end-user facing applications. I will introduce best practices we have built over the time around system design, capacity planning, deployment automation, monitoring integration, performance analysis and troubleshooting. I will also share our experience working with DataStax support to provide a highly available, highly scalable data store fitting into eBay infrastructure.
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
Go90 is a mobile entertainment platform offering access to live and on demand videos. We built the web services platform and social features like activity feed for go90 by making heavy use of Cassandra and Scala, and would like to share what we learned during development and while operating go90. In this presentation, we cover our data model evolution from the initial prototypes to the current production version and the significant performance gain by using a better data model. We will explain how we apply time series data modeling and the benefits of using expiring columns with DateTieredCompactionStrategy. We will also talk about interesting experiences related to table modifications, tombstones and table pagination. On the operations side, we will discuss our findings on java driver usage, performance, monitoring, cluster maintenance, version upgrade, 2-way ssl and many more. We hope you can learn from our mistakes instead of making them yourself!
About the Speakers
Christopher Webster Software Engineer, AOL
Christopher Webster works on the web services platform for the go90 AOL project. Previously he was a Computer Scientist for the Mission Control Technologies project at NASA Ames Center. Chris worked as a senior staff engineer at Sun Microsystems for Project zembly, the cloud development and deployment environment as well as technical lead in many NetBeans projects. Chris is an author of the NetBeans Field Guide and Assemble the Social Web With Zembly.
Thomas Ng Software Engineer, AOL
Thomas Ng is a software engineer at AOL, building web services for the go90 mobile entertainment platform using Cassandra, Scala and Kafka.
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
Apache Cassandra operations have the reputation to be simple on single datacenter deployments and / or low volume clusters but they become way more complex on high latency multi-datacenter clusters with high volume and / or high throughout: basic Apache Cassandra operations such as repairs, compactions or hints delivery can have dramatic consequences even on a healthy high latency multi-datacenter cluster.
In this presentation, Julien will go through Apache Cassandra mutli-datacenter concepts first then show multi-datacenter operations essentials in details: bootstrapping new nodes and / or datacenter, repairs strategy, Java GC tuning, OS tuning, Apache Cassandra configuration and monitoring.
Based on his 3 years experience managing a multi-datacenter cluster against Apache Cassandra 2.0, 2.1, 2.2 and 3.0, Julien will give you tips on how to anticipate and prevent / mitigate issues related to basic Apache Cassandra operations with a multi-datacenter cluster.
About the Speaker
Julien Anguenot VP Software Engineering, iland Internet Solutions, Corp
Julien currently serves as iland's Vice President of Software Engineering. Prior to joining iland, Mr. Anguenot held tech leadership positions at several open source content management vendors and tech startups in Europe and in the U.S. Julien is a long time Open Source software advocate, contributor and speaker: Zope, ZODB, Nuxeo contributor, Zope and OpenStack foundations member, his talks includes Apache Con, Cassandra summit, OpenStack summit, The WWW Conference or still EuroPython.
Building a Multi-Region Cluster at Target (Aaron Ploetz, Target) | Cassandra ...DataStax
Lessons learned from a year spent building a Cassandra cluster over multiple regions, data centers, and providers. Will discuss our successes and learnings on replication, operations, and application development.
About the Speaker
Aaron Ploetz Lead Technical Architect, Target
Aaron is a Lead Technical Architect for Target, where he coaches development teams on modeling and building applications for Cassandra. He is active in the Cassandra tags on StackOverflow, and has also contributed patches to cqlsh. Aaron holds a B.S. in Management/Computer Systems from the University of Wisconsin-Whitewater, a M.S. in Software Engineering and Database Technologies from Regis University, and is a 2x DataStax MVP for Apache Cassandra.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Nationale EuroCloud Monitor 2015 "Tussen Trotski en Troelstra"Peter Vermeulen
Drie jaar na de eerste Nationale EuroCloud Monitor blikken we terug en kijken we vooruit met behulp van nieuw onderzoek. Waar SaaS "revolutionair" om zich heen grijpt, ontwikkelt IaaS zich in het bedrijfsleven een stuk langzamer. En heeft het gebruik van de cloud nou wat opgeleverd of niet?
Report on OAS Round Table on Indigenous Trade and Development: Case Study of...Wayne Dunn
The Organization of American States (OAS) Round Table followed up on the UNDP Round Table and was organized to continue to educate and inform the international community on the potential of indigenous partnerships and trade, and to showcase the recently formed partnership between the Meadow Lake Tribal Council of Canada and the Miskito Indian development organization CIDESA, of Nicaragua. The session, which was held at OAS Headquarters in Washington DC, and was organized and chaired by Wayne Dunn, brought together a broad range of indigenous development practioners, policy makers, international experts and indigenous peoples from throughout the Americas. The discussion focused on the potential for Canadian indigenous development expertise to provide technical assistance and support to indigenous peoples elsewhere in Latin America with particular focus on the Meadow Lake/Miskito partnership.
Convince your CEO to go digital. The world has shifted, but the C-suite needs a way to sell it to the board and to middle management. This is your job. How does social media relate to delivering strategic goals?
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day.
Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
Warning: This presentation will probably contain some references to late 2000's pop group LMFAO
About the Speaker
Ben Bromhead CTO, Instaclustr
Ben Bromhead is the CTO of Instaclustr where he is responsible for working closely with his engineering team and customers to build highly available, scalable applications on top of Cassandra. Instaclustr is the only multi-cloud, self service Cassandra as a Service provider in the world and is dedicated to provider world class support.
AWS Webcast - Managing Big Data in the AWS Cloud_20140924Amazon Web Services
This presentation deck will cover specific services such as Amazon S3, Kinesis, Redshift, Elastic MapReduce, and DynamoDB, including their features and performance characteristics. It will also cover architectural designs for the optimal use of these services based on dimensions of your data source (structured or unstructured data, volume, item size and transfer rates) and application considerations - for latency, cost and durability. It will also share customer success stories and resources to help you get started.
Some of the most common questions we hear from users relate to capacity planning and hardware choices. How many replicas do I need? Should I consider sharding right away? How much RAM will I need for my working set? SSD or HDD? No one likes spending a lot of cash on hardware and cloud bills can just be as painful. MongoDB is different from traditional RDBMSs in its resource management, so you need to be mindful when deciding on the cluster layout and hardware. In this talk we will review the factors that drive the capacity requirements: volume of queries, access patterns, indexing, working set size, among others. Attendees will gain additional insight as we go through a few real-world scenarios, as experienced with MongoDB Inc customers, and come up with their ideal cluster layout and hardware.
Production NoSQL in an Hour: Introduction to Amazon DynamoDB (DAT101) | AWS r...Amazon Web Services
Amazon DynamoDB is a fully-managed, zero-admin, high-speed NoSQL database service. Amazon DynamoDB was built to support applications at any scale. With the click of a button, you can scale your database capacity from a few hundred I/Os per second to hundreds of thousands of I/Os per second or more. You can dynamically scale your database to keep up with your application's requirements while minimizing costs during low-traffic periods. The service has no limit on storage. You also learn about Amazon DynamoDB's design principles and history.
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2mAKgJi.
Ian Nowland and Joel Barciauskas talk about the challenges Datadog faces as the company has grown its real-time metrics systems that collect, process, and visualize data to the point they now handle trillions of points per day. They also talk about how the architecture has evolved, and what they are looking to in the future as they architect for a quadrillion points per day. Filmed at qconnewyork.com.
Ian Nowland is the VP Engineering Metrics and Alerting at Datadog. Joel Barciauskas currently leads Datadog's distribution metrics team, providing accurate, low latency percentile measures for customers across their infrastructure.
Work with hundred of hot terabytes in JVMsMalin Weiss
Third-party updates to the database can cause Hazelcast applications to work with data which is out-of-date.
By synchronizing with an underlying database using an SQL Reflector, the Hazelcast Maps will be “alive” and change whenever the underlying data changes. The solution can also automatically derive domain models directly from the database schemas, so that you can start using the solution very quickly and handle extreme volumes of data.
C* Summit 2013: Large Scale Data Ingestion, Processing and Analysis: Then, No...DataStax Academy
The presentation aims to highlight the challenges posed by large scale and near real-time data processing problems. In past, such problems were solved using conventional technologies, primarily a database and JMS queue. However these solutions had their limits and presented serious problems in terms of scale and redundancy. The new breed of products - a la Cassandra & Kafka, being innately distributed in their design, aim to tackle such challenges in a very elegant manner. The presentation will showcase some of the use cases of this genre from the industry and describe the solutions which have been increasing in their sophistication.
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. You¹ll also hear from Dan Wagner, CEO at Civis Analytics, as he discusses why the Civis data science platform was designed on top of Amazon Redshift and the AWS platform in order to help smart organizations bridge their data silos, build 360 degree view of their customer relationships, and identify opportunities for driving their companies forward by leveraging enormous datasets, the power of analytics, and economies of scale on the AWS platform.
Dyn delivers exceptional Internet Performance. Enabling high quality services requires data centers around the globe. In order to manage services, customers need timely insight collected from all over the world. Dyn uses DataStax Enterprise (DSE) to deploy complex clusters across multiple datacenters to enable sub 50 ms query responses for hundreds of billions of data points. From granular DNS traffic data, to aggregated counts for a variety of report dimensions, DSE at Dyn has been up since 2013 and has shined through upgrades, data center migrations, DDoS attacks and hardware failures. In this webinar, Principal Engineers Tim Chadwick and Rick Bross cover the requirements which led them to choose DSE as their go-to Big Data solution, the path which led to SPARK, and the lessons that we’ve learned in the process.
Pollfish is a survey platform which provides access to millions of targeted users. Pollfish allows easy distribution and targeting of surveys through existing mobile apps. (https://www.pollfish.com/). At pollfish we use Cassandra for difference use cases, eg. for application data store to maximize write throughput when appropriate and for our analytics project to find insights in application generated data. As a medium to accomplish our success so far, we use the Datastax's DSE 4.6 environment which integrates Appache Cassadra, Spark and a hadoop compatible file system (CFS). We will discuss how we started, how the journey was and the impressions gained so far along with some tips learned the hard way. This is a result of joint work of an excellent team here at Pollfish.
Scaling with sync_replication using Galera and EC2Marco Tusa
Challenging architecture design, and proof of concept on a real case of study using Syncrhomous solution.
Customer asks me to investigate and design MySQL architecture to support his application serving shops around the globe.
Scale out and scale in base to sales seasons.
In addition to running databases in Amazon EC2, AWS customers can choose among a variety of managed database services. These services save effort, save time, and unlock new capabilities and economies. In this session, we make it easy to understand how they differ, what they have in common, and how to choose one or more. We explain the fundamentals of Amazon DynamoDB, a fully managed NoSQL database service; Amazon RDS, a relational database service in the cloud; Amazon ElastiCache, a fast, in-memory caching service in the cloud; and Amazon Redshift, a fully managed, petabyte-scale data-warehouse solution that can be surprisingly economical. We will cover how each service might help support your application, how much each service costs, and how to get started. We will also have with us Jeongsang Baek, the VP of Engineering from IGAWorks, Korea’s No.1 mobile business platform, who will walk us through their architecture and share with us the key insights that they gained from using the various AWS database technologies to deliver a reliable, efficient and cost-effective experience.
Let's dive deeper into the world of ODC! Ricardo Alves (OutSystems) will join us to tell all about the new Data Fabric. After that, Sezen de Bruijn (OutSystems) will get into the details on how to best design a sturdy architecture within ODC.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
4. A High Level look at RTB
1. Browsers visit Publishers and create impressions.
2. Publishers sell impressions via Exchanges.
3. Exchanges serve as auction houses for the impressions
4. On behalf of the marketer,m6d bids the impressions via the
auction house. If m6d wins, we display our ad to the
browser.
5. Performance and Data
• Billions and billions of bid requests a day
• A single request can result in multiple
Cassandra Operations!
• One cluster is just under 10TB and growing
• Low latency requirement below 120 ms typical
• Limited data available tom6dvia the exchange
6. Segment Data
Segments are how we assign product or service
affinity to a group of users. User’s we consider to be
like minded with respect to a given brand will be
placed in the same segment.
Segment Data is just one component of our
overarching data model.
Segments help to reduce the number of calculations
we do in real time.
7. Old Approach for Segment Data
Application Nodes
(Tomcat + MySQL )
Limitations
•Periodically updated.
MySQL Data Push Event Logs •Only subsection of
the data.
•Cluster performance
is effected during a
data push.
Aggregation Hadoop
8. Cassandra Approach
for Segment Data
Application Nodes Better!
(Tomcat + Less • Updating in real time now
MySQL Usage) possible
• Distributed not duplicated
• Lesscomplexity to manage
• Storing more information
• We can now bid on users
Cassandra sooner!
9.
10. During waking hours: Dr. Realtime
• User traffic is at peak
• Applications need low latency operations
• High volume of read and write operations
• Desire high cache hit rate to limit disk IO
• Dr. Realtime conducts 'experiments' on
optimization
11. Experiment: Active Set, VFS, cache
size tuning
• Cluster optimization is a topic that must be
revisited periodically
• User base and requests are perpetually growing
• Amount of physical data stored grows
• New features typically result in new data and
more requests
• How to tune your environment is application
and hardware dependent
12. Physical data directory
• sstable holds data
• Index holds offsets to
avoid disk seeks
• Bloom filter probabilistic
lookup system
– (also a stat table)
13. When RAM > Data Size
• If you can afford to keep
your data set in RAM:
• It is fast from VFS cache
• That's it. Your optimized.
• However you do not
usually need this much
ram
14. When RAM < Data Size
• The OS will cache the most
active portions of disk
• The write/compact model
causes the cache to churn
• User requests causes the
cache to churn
15. Understanding Active set with a
hypothetical example
Webmail service (Coldmail):
• I have an account for 10 years, I never log in
more than twice a month
• I have 1,000,000 items in my inbox
• Not in the active set
Social networking (chirper):
• I am logged in every day
• Commonly read get updates from my friends
• In the active set
16. $60,000 Question
How do you determine what the
active set of your application and
user base is?
18. Turn on a cache
• JMX allows you to tune only a single node
for side by side comparisons
• Set the size very large for key cache (be
more careful with row cache)
19. Analysis
• 8:30 hit rate 91%
1.2 mil
• 10:30 hit rate ~93%
1.7 mil
• Past 1.2 million
entry cache might
be better spent
elsewhere
20. Active set conclusions
• Determine sweet spot for hit rate and cache size
• Do not try to cache long tail of requests
• When all other things equal dedicate more
cache to most read column family
• Use row cache only if rows are a predictable size
• Large row caches can not be saved so cold on
restart
21. read_repair_chance – Cassandra's
version of an ethical dilemma
• Read Repair generates additional reads across the cluster
for each user read
• Read Repair Chance controls the probability of Read Repair
occurring.
• If data is write-once or write-rarely Read Repair may be
unnecessary
– data read ratio much larger then write ratio
– data that does not need strict consistency
• 1.0 Hinted handoff now does not need to wait on the failure
detector. Read Repair Chance default has been set to 10%
from 100%.
– Cassandra-2045 TX ntelford and co!
22. Analysis for RRC 'test subjects'
Candidate: Many reads few
writes
Inside story: This data used to
take 2 days. A few ms...
Come on man!
Candidate ?: Many writes
Inside story: This is used for
frequency capping, higher %
justified
23. Experiment: Test the limits of NoSQL
science with YCSB
YCSB is a distributed load generator
that comes in handy!
• Before our upgrade from 0.6.X->0.7.X
– All the benchmarks were better
– But good to kick the tires
• Prototyping new Column Family
– Time to write 500 million records
– How many reads/second on 50GB of data
25. Round 1 Results
RunTime: 410 Seconds
Throughput: 2437 Operations/Second
Shared the results on #cassandrairc.
Suggestion! Try: -threads 30
26. Trying it again…
Original Results:
-threads 10
RunTime: 410 Seconds
Throughput: 2437 Operations/Second
New Results:
-threads 30
RunTime: 196 Seconds
Throughput 5088 Operations/Second
27. Cassandra writes fast! (duh)
• Read path
– Row, Key, and VFS caches
– With enough data and read ops disks bottleneck
• Write path
– structured log writes are linear to disk-wide and fast
– compaction merges sstables in background
• Many threads maximizes write capability
• Many threads also stops a read blocking on IO
from limiting write potential
28. Night falls and Dr. Realtime
transforms...
/etc/cron.d/mr_batch_dr_realtime
# turn into Mr. batch at night
0 0 * * * root nodetool -h `hostname` setcompactionthroughput999
#turn back into Dr. Realtime for day
0 6 * * * root nodetool -h `hostname` setcompactionthroughput16
Setting throughput ensures
• During the day most iops are free to serve traffic
• At night can rip through compactions
29. Mr Batch ravages data creating
tombstones
• If User clears cookies they vanish forever
• In actuality they return as a new user
• Data has very high turnover
• We need to enforce retention policy on data
• TTL columns do not meet our requirements :(
• Cleanup daemon is a throttled range scanner
• Cleanup daemon also produces histograms
every cycle
31. A note about different workloads
• Structured log format of C* has deep implications
• Many factors effect performance and disk size:
• Write once data
• Wide rows (many columns)
• Wide rows over time (fragmented)
• Application read write profile
• Deletion/update percentage
• LevelDB inspired compaction in 1.0 different profile then current
tiered compaction
32. Tombstones have costs
• Physically live on disk
• Bloat data, index, and
bloom filters
• Tombstone live for a grace
period and then are
eligible to be removed
33. Caching after (major) compaction
• Our case (lots of churn)
major compaction shrinks
data significantly
• Rows fragmented over
many sstables are joined
• Tombstones and related
data columns removed
• All files should be smaller
• Smaller files means better
VFS caching