Presenter: Jean Armel Luce, Cassandra Administrator at Orange
At the Cassandra Summit Europe 2013, Jean Armel presented "The Cassandra Experience at Orange - Season 1", explaining the 1st steps of Cassandra at Orange (choice of Cassandra, migration without any interruption of service, improvements of the QoS after the migration). For the "Cassandra Experience at Orange - Season 2", Jean Armel is going to focus on 2 new features added to the PnS application during the last months: Graphs and analytics. A Cassandra table must have 1 and only 1 primary key, while some data have many logical identifiers. Designing data as a graph may help! As for analyitcs, Hadoop + Hive allow to do analytics on data stored in Cassandra. This presentation is going to highlight a few tips about the installation of Hadoop/Hive over C*, and about the isolation between mapreduce tasks and on line queries.
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...DataStax Academy
Presenter: Eiti Kimura, Senior Software Engineer at Movile
Apache Cassandra was adopted by Movile in 2009, and became a fundamental piece within the robust and scalable architecture to support more than 50 products, impacted by over 200MM users in Latin America. In this case we present the architecture of our ring, configuration details, detailed tuning, hardware used to be able to achieve our performance requirements (order of a few milliseconds), information storage strategies for network and disk space optimization, and best practices, in addition to showing the evolution of the architecture of simple systems to become scalable and distributed platforms. We introduced our cluster with a relatively low number of nodes (6) using commodity hardware to support critical high-performance applications. After this talk, you'll understand how Apache Cassandra was essential to evolve our systems and leverage the growth of our business. Movile is the leading mobile content company in Latin America. Movile’s products include mobile content, mobile TV, mobile learning, mobile games, mobile payment, mobile marketing and mobile commerce. Every month, it publishes content and services to more than 20 million mobile costumers. It has grown substantially over the last few years (with a more than 25-fold increase in its revenue over the last five years) both organically and through an aggressive M&A strategy, including five acquisitions in the last five years. Movile is positioning itself as a kind of Silicon Valley company based in Brazil. For the last two years, Movile has been named in the “Great Place to Work” list for technology companies in Brazil. The company shareholders include the founders of the company plus Naspers, a South-African media conglomerate.
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...DataStax Academy
Presenter: Claudiu Barbura, Senior Director of Engineering at Atigeo
xPatterns is a big data analytics platform-as-a-service that enables rapid development of enterprise-grade analytical applications. It provides tools, API sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to Cassandra and solrCloud clusters for real-time access through low-latency/high-throughput (automatically generated) apis as well as dashboard and visualization api/tools leveraging the available data and models. In this talk I'll share some of the hard lessons we've learned in the past three years while leveraging Cassandra (and Hector) in large-scale enterprise-grade deployments. We will focus on three specific areas, in which we identified consistent best practices & design patterns: data model optimization as a result of exporting data from HDFS/Hive/Shark into Cassandra through Spark/Hadoop MR jobs under Mesos with throttling, instrumentation and resilience features, automatically publishing geo-replicated, instrumented and monitored REST API's on top of the exported Cassandra data, and lessons learned from running Cassandra at scale from 0.6 to 2.0.6, including performance tuning, and tips and tricks. You will see live demos of our Publish to NoSql tools (Spark/Shark, Mesos, Hive, Cassandra ), a dashboard application built on top of generated data apis (D3.js, Cassandra) and xPatterns' monitoring and instrumentation consoles (Graphite, Ganglia, Nagios).
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...DataStax Academy
Presenter: Harold Nguyen, Senior Data Scientist at Nexgate
In this talk, we focus on a use case by showing how Cassandra can detect spam and spammers on social media. We also show how we use Cassandra to train our 100+ social-media-security classifiers. The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This talk is about how Datastax and Cassandra make it easy.
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...DataStax Academy
Presenter: Alvaro Agea, Big Data Architect at Stratio
Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.
Cassandra Summit 2014: A Train of Thoughts About Growing and Scalability — Bu...DataStax Academy
Presenter: Eiti Kimura, Senior Software Engineer at Movile
Apache Cassandra was adopted by Movile in 2009, and became a fundamental piece within the robust and scalable architecture to support more than 50 products, impacted by over 200MM users in Latin America. In this case we present the architecture of our ring, configuration details, detailed tuning, hardware used to be able to achieve our performance requirements (order of a few milliseconds), information storage strategies for network and disk space optimization, and best practices, in addition to showing the evolution of the architecture of simple systems to become scalable and distributed platforms. We introduced our cluster with a relatively low number of nodes (6) using commodity hardware to support critical high-performance applications. After this talk, you'll understand how Apache Cassandra was essential to evolve our systems and leverage the growth of our business. Movile is the leading mobile content company in Latin America. Movile’s products include mobile content, mobile TV, mobile learning, mobile games, mobile payment, mobile marketing and mobile commerce. Every month, it publishes content and services to more than 20 million mobile costumers. It has grown substantially over the last few years (with a more than 25-fold increase in its revenue over the last five years) both organically and through an aggressive M&A strategy, including five acquisitions in the last five years. Movile is positioning itself as a kind of Silicon Valley company based in Brazil. For the last two years, Movile has been named in the “Great Place to Work” list for technology companies in Brazil. The company shareholders include the founders of the company plus Naspers, a South-African media conglomerate.
Cassandra Summit 2014: Cassandra in Large Scale Enterprise Grade xPatterns De...DataStax Academy
Presenter: Claudiu Barbura, Senior Director of Engineering at Atigeo
xPatterns is a big data analytics platform-as-a-service that enables rapid development of enterprise-grade analytical applications. It provides tools, API sets and a management console for building an ELT pipeline with data monitoring and quality gates, a data warehouse for ad-hoc and scheduled querying, analysis, model building and experimentation, tools for exporting data to Cassandra and solrCloud clusters for real-time access through low-latency/high-throughput (automatically generated) apis as well as dashboard and visualization api/tools leveraging the available data and models. In this talk I'll share some of the hard lessons we've learned in the past three years while leveraging Cassandra (and Hector) in large-scale enterprise-grade deployments. We will focus on three specific areas, in which we identified consistent best practices & design patterns: data model optimization as a result of exporting data from HDFS/Hive/Shark into Cassandra through Spark/Hadoop MR jobs under Mesos with throttling, instrumentation and resilience features, automatically publishing geo-replicated, instrumented and monitored REST API's on top of the exported Cassandra data, and lessons learned from running Cassandra at scale from 0.6 to 2.0.6, including performance tuning, and tips and tricks. You will see live demos of our Publish to NoSql tools (Spark/Shark, Mesos, Hive, Cassandra ), a dashboard application built on top of generated data apis (D3.js, Cassandra) and xPatterns' monitoring and instrumentation consoles (Graphite, Ganglia, Nagios).
Cassandra Summit 2014: Social Media Security Company Nexgate Relies on Cassan...DataStax Academy
Presenter: Harold Nguyen, Senior Data Scientist at Nexgate
In this talk, we focus on a use case by showing how Cassandra can detect spam and spammers on social media. We also show how we use Cassandra to train our 100+ social-media-security classifiers. The accuracy of any security product is directly tied to the breadth of the corpus of data upon which it is built. For Nexgate, this means that the success of our products is inextricably tied to our ability to save everything we've ever scanned, but in a way that is still readily accessible. In the days before NoSQL, this was hard. This talk is about how Datastax and Cassandra make it easy.
Cassandra Summit 2014: META — An Efficient Distributed Data Hub with Batch an...DataStax Academy
Presenter: Alvaro Agea, Big Data Architect at Stratio
Big Data analysis is commonly associated with batch processing of data stored in distributed file systems. The advent of streaming data is exposing the shortcomings of the traditional data analysis. Users aiming to combine both worlds - batch processing and streaming - had to turn to unreliable in-house developments. We propose Stratio META to meet this new need. META is a technology based on a structured NoSQL datastore with advanced indexing capabilities. META includes an efficient query planner designed from scratch. The planner determines which is the optimal path to execute a query and which components should be involved.
A good data model is key to getting the best performance from Apache Cassandra. The Log Structured Storage Engine and it's distributed architecture mean we cannot rely on a paradigm such as Normal Form to evaluate a model. Instead we need to design data models that support the read path of the application. In this talk Aaron Morton will walk through the key principles and patterns of a good CQL3 data model using simple examples.
Cassandra Summit 2014: Apache Cassandra at Telefonica CBSDataStax Academy
Presenter: Antonio Alcocer, Big Data Architect at Stratio
Telefonica is the incumbent telecommunications network operator in Spain and the fourth one in capitalisation in the world. Cyber security is one of our most successful businesses worldwide. We provide monitoring and protecting clients from attacks. We analyze millions of data from multiple sources including social media, DNS records, and underground internet, to generate alerts and security reports for our clients. This use case required a Big Data component capable of processing the data and extract its information in real-time; warnings and alerts are time-sensitive in order to deal efficiently with security attacks. Our original architecture was the typical one used for data fusion systems. It included several collectors, a processing layer based on legacy systems, and a data store. The initial setup included a MongoDB database and an ad-hoc application. This solution however proved to be unfit for the specific purpose of dispatching alerts. We proposed to use Cassandra and Spark instead. This approach did manage to fulfill our original specifications as intended. Our talk will explain the reasons why we migrated the architecture and how the adopted solution based on Spark and Cassandra solved our problem.
Like many startups, Coursera began its data storage journey with MySQL, a familiar and industry-proven database. As Coursera's user base grew from several thousand to many millions, we found that MySQL provided limited availability and restricted our ability to scale easily. New product initiatives and requirements provided a perfect opportunity to revisit our choice of core workhorse database.
After evaluating several NoSQL databases, including MongoDB, DynamoDB and HBase, we elected to transition to Cassandra . Cassandra's relative maturity, masterless architecture (for availability), tunable consistency, and stable low-latency performance made it a clear winner for our needs.
Learn more about what it takes to transition from SQL to Cassandra in this talk.
You've researched. You've discussed. You've had (multiple) meetings. You've installed. You've tested (hopefully). You've have decided. Now what (besides having attended a Cassandra Day)? What else are you going to need to put that Cassandra cluster into beta? Our evangelist team will give you the Cliff Notes to make that next step go as smooth as.... well... as smooth as it can be!
Presenter: Chris Lohfink, Engineer at Pythian
This session will cover a walk-through to provide an understanding of key metrics critical to operating a Cassandra cluster effectively. Without context to the metrics, we just have pretty graphs. With context, we have a powerful tool to determine problems before they happen and to debug production issues more quickly.
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
Monitoring provides information on system performance, however tracing is necessary to understand individual request performance. Detailed query tracing has been provided by Cassandra since version 1.2 and is invaluable when diagnosing problems. Although knowing what queries to trace and why the application makes them still requires deep technical knowledge. By merging Application tracing via Zipkin and Cassandra query tracing we automate the process and make it easier to identify and resolve problems. In this talk Mick Semb Wever, Team Member at The Last Pickle, will introduce Cassandra query tracing and Zipkin. He will then propose an extension that allows clients to pass a trace identifier through to Cassandra, and a way to integrate Zipkin tracing into Cassandra. Driving all this is the desire to create one tracing view across the entire system.
.NET developers have a lot of options when it comes to databases these days. Apache Cassandra is a scalable, fault-tolerant database that has already found its way into more than 25% of the Fortune 100 and continues to grow in popularity. But what makes it different from the myriad of other options available? In this talk, we’ll take a deep dive into Cassandra and learn about:
- Cassandra’s internals and how it works
- CQL (the SQL-like query language for Cassandra)
- Data Modeling like a pro
- Tools available for developers
- Writing .NET code that talks to Cassandra
If there’s time and interest, we’ll finish up with how some companies are already using Cassandra to power services you probably interact with in your daily life. You’ll leave with all the tools you need to start build highly available .NET applications and services on top of Cassandra.
Third normal form? That’s so 20th century. Learn the newest techniques to make your Cassandra database sing from the rafters in performance and scalability. AND it uses concepts that you already know and apply every day. You can do this. This is the must-see half hour of your professional life! These developers found a new way to work with databases. First you will be shocked, then you will be inspired!
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
Many architects in companies ranging from small startups to publicly traded companies are turning to event-driven architectures to solve mission-critical scalability problems, often ones that carry real-time processing requirements. In this talk we'll demonstrate how you can use Apache Cassandra to build powerful event-driven systems in combination with technologies like Akka, RabbitMQ, and others. These concepts will help you radically simplify the design of complex systems and give you the ability to remain available and responsive even in the face of bursty workloads. If your organization does any sort of stream processing or real-time aggregation work with C*, then this talk is for you.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
At Signal we've been running Apache Cassandra in production since late 2011. We use a multi-region Cassandra deployment to make our data available globally to our customers. While Cassandra does much of the heavy lifting for us, we've run into interesting challenges during periods of rapid growth. In this presentation we'll focus on one of those scenarios, including our before and after data model, methodology and tools we used to recover and lessons learned along the way.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Assume you have a Cassandra cluster with hundreds of tables, and one day the latency of client requests and CPU utilization of the Cassandra process became unacceptable. Our team regularly faced such a problem. A simple look at the metrics did not help us - because of high CPU utilization, we saw bad metrics across almost all of our tables. In this talk I will discuss ways to find out which table (or tables) are the most problematic for the cluster and create problems for other tables. The talk will be fully practical as I will introduce our real steps in this investigation. Some of the steps were successful, others were not. But finally we reduced both latency and CPU utilization in about 10 times without adding additional nodes and hardware resources.
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...DataStax Academy
Thursday, September 24
1:50 PM - 2:30 PM
Ballroom G
Believing Cassandra: Our Big-Data Journey To Enlightenment under the Cassandra Paradigm
It turns out that much can be learned about Cassandra in a year's time, given a high enough pain tolerance on the part of an organization's founders. Join us as we at Timeli.io step you through exactly what happened when we walked into an in-production time series implementation that somehow could not return data in a time series format to its existing customers. We will then discuss how we then re-worked that same implementation to be fully functional, and how we started on the road to finding the keys to Cassandra's legendary performance capabilities in a Zookeeper/AMQP/SQL/Tomcat stack. The path for this journey left no block unstumbled, so if your organization still has toes that are left unbruised this talk could well save you pain.
One is the loneliest number
Much, much worse than two
Many of PagerDuty’s mission-critical services are based on Cassandra, and as a result we have built up a lot of operational experience over the past few years. Unfortunately, some of our best learnings have come from sizeable failures in production. One of those failures stemmed from having multiple services share the same Cassandra cluster, which was a major factor in PagerDuty’s largest outage of 2014. This talk will relive that outage, sort through the wreckage, and explain why isolating your Cassandra clusters is a best practice you should adopt
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleDataStax Academy
Take a deep dive into understanding best practices for Cassandra data modelin,g with a review of a time series data modeling example. Partition key selection, data duplication, in place aggregation, as well as using TTL's and DateTieredCompaction to positive effect will all be covered.
C* Summit EU 2013: The Cassandra Experience at Orange DataStax Academy
Speaker: Jean Armel Luce — Senior Software Engineer/Cassandra Admin at Orange
Video: http://www.youtube.com/watch?v=mefOE9K7sLI&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=28
At Orange, Jean Armel has helped develop an open source tool for the migration of data to Cassandra; Jean and his team were in need of the NoSQL solution Apache Cassandra in order to sustain the growth of requests and volume of data required by their application PnS. In this session, Jean Armel will start out with an overview of the Orange application PnS and dive into why they chose Apache Cassandra how they did their data migration without any interruption of service. Jean Armel will also show how his application behaves after the migration
How to Store and Visualize CAN Bus Telematic Data with InfluxDB Cloud and Gra...InfluxData
CSS Electronics develop & manufacture professional-grade, simple-to-use CAN bus data loggers. Their plug-and-play CANedge2 logger records time-stamped raw CAN data to an extractable industrial SD card — and connects via WiFi/3G/4G access points to upload the data to the end user’s own servers. The CANedge2 is ideal for collecting automotive sensor metrics like speed, temperatures, state of charge, GPS and more. Learn how to create your own telematics dashboard built on InfluxDB in minutes by attending this webinar!
Join us as Martin Falch dives into:
CSS Electronics’ approach to improving R&D field testing, diagnostics, fleet management and predictive maintenance
The CANedge’s methodology for collecting IIoT data from cars, trucks and machines
How they process time series data from S3 via Python to store it in InfluxDB and visualize it with Grafana
A good data model is key to getting the best performance from Apache Cassandra. The Log Structured Storage Engine and it's distributed architecture mean we cannot rely on a paradigm such as Normal Form to evaluate a model. Instead we need to design data models that support the read path of the application. In this talk Aaron Morton will walk through the key principles and patterns of a good CQL3 data model using simple examples.
Cassandra Summit 2014: Apache Cassandra at Telefonica CBSDataStax Academy
Presenter: Antonio Alcocer, Big Data Architect at Stratio
Telefonica is the incumbent telecommunications network operator in Spain and the fourth one in capitalisation in the world. Cyber security is one of our most successful businesses worldwide. We provide monitoring and protecting clients from attacks. We analyze millions of data from multiple sources including social media, DNS records, and underground internet, to generate alerts and security reports for our clients. This use case required a Big Data component capable of processing the data and extract its information in real-time; warnings and alerts are time-sensitive in order to deal efficiently with security attacks. Our original architecture was the typical one used for data fusion systems. It included several collectors, a processing layer based on legacy systems, and a data store. The initial setup included a MongoDB database and an ad-hoc application. This solution however proved to be unfit for the specific purpose of dispatching alerts. We proposed to use Cassandra and Spark instead. This approach did manage to fulfill our original specifications as intended. Our talk will explain the reasons why we migrated the architecture and how the adopted solution based on Spark and Cassandra solved our problem.
Like many startups, Coursera began its data storage journey with MySQL, a familiar and industry-proven database. As Coursera's user base grew from several thousand to many millions, we found that MySQL provided limited availability and restricted our ability to scale easily. New product initiatives and requirements provided a perfect opportunity to revisit our choice of core workhorse database.
After evaluating several NoSQL databases, including MongoDB, DynamoDB and HBase, we elected to transition to Cassandra . Cassandra's relative maturity, masterless architecture (for availability), tunable consistency, and stable low-latency performance made it a clear winner for our needs.
Learn more about what it takes to transition from SQL to Cassandra in this talk.
You've researched. You've discussed. You've had (multiple) meetings. You've installed. You've tested (hopefully). You've have decided. Now what (besides having attended a Cassandra Day)? What else are you going to need to put that Cassandra cluster into beta? Our evangelist team will give you the Cliff Notes to make that next step go as smooth as.... well... as smooth as it can be!
Presenter: Chris Lohfink, Engineer at Pythian
This session will cover a walk-through to provide an understanding of key metrics critical to operating a Cassandra cluster effectively. Without context to the metrics, we just have pretty graphs. With context, we have a powerful tool to determine problems before they happen and to debug production issues more quickly.
The Last Pickle: Distributed Tracing from Application to DatabaseDataStax Academy
Monitoring provides information on system performance, however tracing is necessary to understand individual request performance. Detailed query tracing has been provided by Cassandra since version 1.2 and is invaluable when diagnosing problems. Although knowing what queries to trace and why the application makes them still requires deep technical knowledge. By merging Application tracing via Zipkin and Cassandra query tracing we automate the process and make it easier to identify and resolve problems. In this talk Mick Semb Wever, Team Member at The Last Pickle, will introduce Cassandra query tracing and Zipkin. He will then propose an extension that allows clients to pass a trace identifier through to Cassandra, and a way to integrate Zipkin tracing into Cassandra. Driving all this is the desire to create one tracing view across the entire system.
.NET developers have a lot of options when it comes to databases these days. Apache Cassandra is a scalable, fault-tolerant database that has already found its way into more than 25% of the Fortune 100 and continues to grow in popularity. But what makes it different from the myriad of other options available? In this talk, we’ll take a deep dive into Cassandra and learn about:
- Cassandra’s internals and how it works
- CQL (the SQL-like query language for Cassandra)
- Data Modeling like a pro
- Tools available for developers
- Writing .NET code that talks to Cassandra
If there’s time and interest, we’ll finish up with how some companies are already using Cassandra to power services you probably interact with in your daily life. You’ll leave with all the tools you need to start build highly available .NET applications and services on top of Cassandra.
Third normal form? That’s so 20th century. Learn the newest techniques to make your Cassandra database sing from the rafters in performance and scalability. AND it uses concepts that you already know and apply every day. You can do this. This is the must-see half hour of your professional life! These developers found a new way to work with databases. First you will be shocked, then you will be inspired!
Lessons Learned with Cassandra and Spark at the US Patent and Trademark OfficeDataStax Academy
This case study concerns moving large amounts of patent data from Cassandra to Solr. How we approached the problem, the introduction of Spark as a solution, and how to optimize Spark jobs. I will cover:
* Understanding the parts of a Spark Job. Which components run where and common issues.
* Adding metrics to show where pain points are in your code.
* Comparing various methods in the API to achieve more performant code.
* How we saved time and made a repeatable process with Spark.
Many architects in companies ranging from small startups to publicly traded companies are turning to event-driven architectures to solve mission-critical scalability problems, often ones that carry real-time processing requirements. In this talk we'll demonstrate how you can use Apache Cassandra to build powerful event-driven systems in combination with technologies like Akka, RabbitMQ, and others. These concepts will help you radically simplify the design of complex systems and give you the ability to remain available and responsive even in the face of bursty workloads. If your organization does any sort of stream processing or real-time aggregation work with C*, then this talk is for you.
Cassandra Summit 2014: Interactive OLAP Queries using Apache Cassandra and SparkDataStax Academy
Presenter: Evan Chan, Principal Software Engineer at Socrata Inc.
How do you rapidly derive complex insights on top of really big data sets in Cassandra? This session draws upon Evan's experience building a distributed, interactive, columnar query engine on top of Cassandra and Spark. We will start by surveying the existing query landscape of Cassandra and discuss ways to integrate Cassandra and Spark. We will dive into the design and architecture of a fast, column-oriented query architecture for Spark, and why columnar stores are so advantageous for OLAP workloads. I will present a schema for Parquet-like storage of analytical datasets on Cassandra. Find out why Cassandra and Spark are the perfect match for enabling fast, scalable, complex querying and storage of big analytical data.
At Signal we've been running Apache Cassandra in production since late 2011. We use a multi-region Cassandra deployment to make our data available globally to our customers. While Cassandra does much of the heavy lifting for us, we've run into interesting challenges during periods of rapid growth. In this presentation we'll focus on one of those scenarios, including our before and after data model, methodology and tools we used to recover and lessons learned along the way.
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
Assume you have a Cassandra cluster with hundreds of tables, and one day the latency of client requests and CPU utilization of the Cassandra process became unacceptable. Our team regularly faced such a problem. A simple look at the metrics did not help us - because of high CPU utilization, we saw bad metrics across almost all of our tables. In this talk I will discuss ways to find out which table (or tables) are the most problematic for the cluster and create problems for other tables. The talk will be fully practical as I will introduce our real steps in this investigation. Some of the steps were successful, others were not. But finally we reduced both latency and CPU utilization in about 10 times without adding additional nodes and hardware resources.
Timeli: Believing Cassandra: Our Big-Data Journey To Enlightenment under the ...DataStax Academy
Thursday, September 24
1:50 PM - 2:30 PM
Ballroom G
Believing Cassandra: Our Big-Data Journey To Enlightenment under the Cassandra Paradigm
It turns out that much can be learned about Cassandra in a year's time, given a high enough pain tolerance on the part of an organization's founders. Join us as we at Timeli.io step you through exactly what happened when we walked into an in-production time series implementation that somehow could not return data in a time series format to its existing customers. We will then discuss how we then re-worked that same implementation to be fully functional, and how we started on the road to finding the keys to Cassandra's legendary performance capabilities in a Zookeeper/AMQP/SQL/Tomcat stack. The path for this journey left no block unstumbled, so if your organization still has toes that are left unbruised this talk could well save you pain.
One is the loneliest number
Much, much worse than two
Many of PagerDuty’s mission-critical services are based on Cassandra, and as a result we have built up a lot of operational experience over the past few years. Unfortunately, some of our best learnings have come from sizeable failures in production. One of those failures stemmed from having multiple services share the same Cassandra cluster, which was a major factor in PagerDuty’s largest outage of 2014. This talk will relive that outage, sort through the wreckage, and explain why isolating your Cassandra clusters is a best practice you should adopt
Cassandra Day Atlanta 2015: Data Modeling In-Depth: A Time Series ExampleDataStax Academy
Take a deep dive into understanding best practices for Cassandra data modelin,g with a review of a time series data modeling example. Partition key selection, data duplication, in place aggregation, as well as using TTL's and DateTieredCompaction to positive effect will all be covered.
C* Summit EU 2013: The Cassandra Experience at Orange DataStax Academy
Speaker: Jean Armel Luce — Senior Software Engineer/Cassandra Admin at Orange
Video: http://www.youtube.com/watch?v=mefOE9K7sLI&list=PLqcm6qE9lgKLoYaakl3YwIWP4hmGsHm5e&index=28
At Orange, Jean Armel has helped develop an open source tool for the migration of data to Cassandra; Jean and his team were in need of the NoSQL solution Apache Cassandra in order to sustain the growth of requests and volume of data required by their application PnS. In this session, Jean Armel will start out with an overview of the Orange application PnS and dive into why they chose Apache Cassandra how they did their data migration without any interruption of service. Jean Armel will also show how his application behaves after the migration
How to Store and Visualize CAN Bus Telematic Data with InfluxDB Cloud and Gra...InfluxData
CSS Electronics develop & manufacture professional-grade, simple-to-use CAN bus data loggers. Their plug-and-play CANedge2 logger records time-stamped raw CAN data to an extractable industrial SD card — and connects via WiFi/3G/4G access points to upload the data to the end user’s own servers. The CANedge2 is ideal for collecting automotive sensor metrics like speed, temperatures, state of charge, GPS and more. Learn how to create your own telematics dashboard built on InfluxDB in minutes by attending this webinar!
Join us as Martin Falch dives into:
CSS Electronics’ approach to improving R&D field testing, diagnostics, fleet management and predictive maintenance
The CANedge’s methodology for collecting IIoT data from cars, trucks and machines
How they process time series data from S3 via Python to store it in InfluxDB and visualize it with Grafana
Moolle fan-out control for scalable distributed data storesSungJu Cho
Many Online Social Networks horizontally partition data across data stores. This allows the addition of server nodes to increase capacity and throughput. For single key lookup queries such as computing a member's 1st degree connections, clients need to generate only one request to one data store. However, for multi key lookup queries such as computing a 2nd degree network, clients need to generate multiple requests to multiple data stores. The number of requests to fulfill the multi key lookup queries grows in relation to the number of partitions. Increasing the number of server nodes in order to increase capacity also increases the number of requests between the client and data stores. This may increase the latency of the query response time because of network congestion, tail-latency, and CPU bounding. Replication based partitioning strategies can reduce the number of requests in the multi key lookup queries. However, reducing the number of requests in a query can degrade the performance of certain queries where processing, computing, and filtering can be done by the data stores. A better system would provide the capability of controlling the number of requests in a query. This paper presents Moolle, a system of controlling the number of requests in queries to scalable distributed data stores. Moolle has been implemented in the LinkedIn distributed graph service that serves hundreds of thousands of social graph traversal queries per second. We believe that Moolle can be applied to other distributed systems that handle distributed data processing with a high volume of variable-sized requests.
For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2023/07/efficiently-map-ai-and-vision-applications-onto-multi-core-ai-processors-using-cevas-parallel-processing-framework-a-presentation-from-ceva/
Rami Drucker, Machine Learning Software Architect at CEVA, presents the “Efficiently Map AI and Vision Applications onto Multi-core AI Processors Using CEVA’s Parallel Processing Framework” tutorial at the May 2023 Embedded Vision Summit.
Next-generation AI and computer vision applications for autonomous vehicles, cameras, drones and robots require higher-than-ever computing power. Often, the most efficient way to deliver high performance (especially in cost- and power-constrained applications) is to use multi-core processors. But developers must then map their applications onto the multiple cores in an efficient manner, which can be difficult. To address this challenge and streamline application development, CEVA has introduced the Architecture Planner tool as a new element in CEVA’s comprehensive AI SDK.
In this talk, Drucker shows how the Architecture Planner tool analyzes the network model and the processor configuration (number of cores, memory sizes), then automatically maps the workload onto the multiple cores in an efficient manner. He explains key techniques used by the tool, including symmetrical and asymmetrical multi-processing, partition by sub-graphs, batch partitioning and pipeline partitioning.
Introduction to Programmable Networks by Clarence Anslem, IntelMyNOG
Network devices like switches or routers are most commonly designed a bottom-up. The switch vendors that offer products to their clients usually rely on external chips from 3rd party silicon vendors. The chip is the heart of the system and in practice determines how device OS is realized and what functionality it can offer. Since the chip is a fixed-function unit and its internal packet processing pipeline cannot be easily reconfigured at runtime, adding a new feature set is a complex process that may take months. This is because a chip redesign is usually required. P4 & Programmable ASIC’S aims to break these barriers and enable innovation on networking devices similar to CPU’s, GPU’s, DSP’s in the computing ecosystem.
European Utility Week (2/2) | Paris - 12 au 14 novembre 2019Cluster TWEED
Les clusters TWEED et Flux50 ont emmené 13 entreprises belges (wallonnes, bruxelloises et flamandes), réunies sous les couleurs belges, au salon mondial European Utility Week mi-novembre!
Orateurs : Comsof, FifthPlay, GreenWatch, Gorilla, Niko, Option, Powerdale, WeSmart.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2020/12/trends-in-neural-network-topologies-for-vision-at-the-edge-a-presentation-from-synopsys/
For more information about edge AI and computer vision, please visit:
https://www.edge-ai-vision.com
Pierre Paulin, Director of R&D for Embedded Vision at Synopsys, presents the “Trends in Neural Network Topologies for Vision at the Edge” tutorial at the September 2020 Embedded Vision Summit.
The widespread adoption of deep neural networks (DNNs) in embedded vision applications has increased the importance of creating DNN topologies that maximize accuracy while minimizing computation and memory requirements. This has led to accelerated innovation in DNN topologies.
In this talk, Paulin summarizes the key trends in neural network topologies for embedded vision applications, highlighting techniques employed by widely used networks such as EfficientNet and MobileNet to boost both accuracy and efficiency. He also touches on other optimization methods—such as pruning, compression and layer fusion—that developers can use to further reduce the memory and computation demands of modern DNNs.
Autonomous driving requires safety considerations and the need of “fail operational” requires redundancy. In the networking portion of a car, this may mean separate networks, possibly of different technologies. Or it could mean a network topology and technology that supports scalable redundancy, like Ethernet TSN.
This presentation focuses on IEEE 802.1CB-2017, which is the TSN standard that supports data redundancy through the network. Various network topologies are examined. The relative costs of adding TSN redundancy for these topologies (including some, or all of, the end-stations/ECUs & bridges) are examined for various bandwidth utilizations, along with the expected packet loss. Each topology and bandwidth will be modeled under various bit-rate error values with the results discussed.
This presentation aims at providing a clear understanding of the TSN standards that support redundancy, and an understanding of the cost/benefit tradeoffs so proper engineering decisions can be made and proper expectations set.
Similar to Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2 (20)
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
Companies today are innovating with real-time data to deliver truly amazing customer experiences in the moment. Real-time data management for real-time customer experience is core to staying ahead of competition and driving revenue growth. Join Trays to learn how Comcast is differentiating itself from it's own historical reputation with Customer Experience strategies.
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
DataStax Enterprise (DSE) Graph is a built to manage, analyze, and search highly connected data. DSE Graph, built on NoSQL Apache Cassandra delivers continuous uptime along with predictable performance and scales for modern systems dealing with complex and constantly changing data.
Download DataStax Enterprise: Academy.DataStax.com/Download
Start free training for DataStax Enterprise Graph: Academy.DataStax.com/courses/ds332-datastax-enterprise-graph
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
DataStax Enterprise Advanced Replication supports one-way distributed data replication from remote database clusters that might experience periods of network or internet downtime. Benefiting use cases that require a 'hub and spoke' architecture.
Learn more at http://www.datastax.com/2016/07/stay-100-connected-with-dse-advanced-replication
Advanced Replication docs – https://docs.datastax.com/en/latest-dse/datastax_enterprise/advRep/advRepTOC.html
Data Modeling is the one of the first things to sink your teeth into when trying out a new database. That's why we are going to cover this foundational topic in enough detail for you to get dangerous. Data Modeling for relational databases is more than a touch different than the way it's approached with Cassandra. We will address the quintessential query-driven methodology through a couple of different use cases, including working with time series data for IoT. We will also demo a new tool to get you bootstrapped quickly with MovieLens sample data. This talk should give you the basics you need to get serious with Apache Cassandra.
Hear about how Coursera uses Cassandra as the core of its scalable online education platform. I'll discuss the strengths of Cassandra that we leverage, as well as some limitations that you might run into as well in practice.
In the second part of this talk, we'll dive into how best to effectively use the Datastax Java drivers. We'll dig into how the driver is architected, and use this understanding to develop best practices to follow. I'll also share a couple of interesting bug we've run into at Coursera.
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
This talk covers scaling Cassandra to a fast growing user base. Alex and Isaias will cover new best practices and how to work with the strengths and weaknesses of Cassandra at large scale. They will discuss how to adapt to bottlenecks while providing a rich feature set to the playstation community.
This is a two part talk in which we'll go over the architecture that enables Apache Cassandra’s linear scalability as well as how DataStax Drivers are able to take full advantage of it to provide developers with nicely designed and speedy clients extendable to the core.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
When stars align: studies in data quality, knowledge graphs, and machine lear...
Cassandra Summit 2014: The Cassandra Experience at Orange — Season 2
1. The
C*
Experience
at
Orange
Jean
Armel
Luce
Orange
France
Season
2
Thursday,
September
11
2014
2. 2
The Cassandra experience at Orange - season 1
Summary
1. Why
did
we
choose
C*
for
our
applicaHon
PnS
?
2. Our
migraHon
strategy
(without
any
interrupHon
of
service)
3. AOer
the
migraHon
…
§ For more details , watch « the Cassandra Summit Europe 2013 »:
– hQps://www.youtube.com/watch?v=mefOE9K7sLI&feature=youtube_gdata
Jean Armel Luce - Orange-France
3. 3
The Cassandra experience at Orange - season 2
Summary
1. Short
descripHon
of
the
applicaHon
PnS
2. Keyring:
why
design
customer
ids
with
graphs
in
C*
?
3. BYOHH
(Bring
Your
Own
Hadoop
&
Hive)
with
Cassandra
Jean Armel Luce - Orange-France
5. 5
PnS: Short description of the application
§ PnS means Profile and Syndication: a highly available service for
collecting and serving live data about Orange customers
§ End users:
– Orange customers (www.orange.fr)
– Sellers in Orange shops
– Some services in Orange (advertisements, …)
Jean Armel Luce - Orange-France
6. 6
PnS: The Big Picture
Jean Armel Luce - Orange-France
End users
Millions of HTTP requests
(Rest or Soap)
Fast and highly available
WebService to get or set
data stored by pns:
- postProcessing(data1)
- postProcessing(data2)
- postProcessing(data3)
- postProcessing(datax)
- …
Database
PNS
Data providers
Thousands of files
(Csv, json or Xml)
Scheduled data injection
DB Queries
R/W operations
7. § 1 multi DC cluster
§ and web services
(read and writes)
§ for batch updates
7
PnS: Architecture at the end of 2013
2 DCs architecture for high availability
Bagnol
et
Jean Armel Luce - Orange-France
Sophia
Antipolis
8. 8
PnS: Some key dates about the PnS3.0 project
Season 1 (From April 2013 to October 2013)
Migration to C*
Season 2 part 1 (November 2013)
Keyring
Season 2 part 1 (April 2014)
Hadoop & Hive for Analytics
Jean Armel Luce - Orange-France
10. 10
PnS database design
§ Nearly 35 tables at the end of 2013
CREATE TABLE customers (
customer_id varchar,
col1 varchar,
col2 bigint,
col3 set<text>,
...,
coln timestamp,
PRIMARY KEY (customer_id));
§ SELECT colx, coly, colz FROM customers WHERE customer_id = '???' ;
Jean Armel Luce - Orange-France
11. 11
Customer ids
§ What is a customer id ?
– cell
number
– internet
account
– email
address
– ISE
(internal
identifier
used
by
many
other
Orange
applications)
– ....
§ For many reasons, data is stored in tables with different primary keys
– some data are often retrieved using a cell number
è stored when possible in a table where PK is a cell number
– … but all customers don’t have a cell number
è stored in a table where PK is not a cell number
– …
Jean Armel Luce - Orange-France
12. 12
Customer ids translation
§ A PnS user knows only 1 customer id
§ He often needs to retrieve data indexed by another kind of cust id in the
DB
Jean Armel Luce - Orange-France
My cell number
is (209)
123-4567 SELECT * FROM pns
WHERE cust_id = ‘ISE_QWERTY’
customer_id
translation
13. 13
Database design in the old relational databases
§ Design with secondary indexes ?
SELECT email_address FROM customer_ids WHERE cell_number
= ???;
§ Requires a lot of secondary indexes with values having high cardinality
§ With C*, secondary indexes with values having a high cardinality are
wasteful
Jean Armel Luce - Orange-France
ISE Cell_
number
email_
address … idtypeN
Primary
Key
Secondary
indexes
intranet
account
15. 15
The new « Customer ids » table in C*
§ Table of edges between customer ids
CREATE TABLE graph(
idvalue1 text, -- type of the initial vertex of the arc
idtype1 text, -- value of the initial vertex of the arc
idvalue2 text, -- type of the terminal vertex of the arc
idtype2 text, -- value of the terminal vertex of the arc
attr map<text, text>, -- a column of map type for storing any kind of property
t timestamp,
PRIMARY KEY ((idvalue1) , idtype1 , idtype2 , idvalue2 )
);
SELECT * FROM graph WHERE idvalue1 IN (‘???’)
Jean Armel Luce - Orange-France
16. 16
Small independant graphs
§ 500.000.000 edges in the graph
§ The keyring graph is not a single large graph
§ It’s rather a lot of small independant undirected graphs
Ø Each vertex has a small neighborhood.
Ø The search of a customer id is limited into a small subset of
the edges and vertices
Jean Armel Luce - Orange-France
17. 17
Atomicity
§ The edges are bi-directional (undirected)
– We need to insert or update 2 rows for each edge
– The atomic batch mode guarantees that the 2 directions are updated
atomically
Jean Armel Luce - Orange-France
18. 18
Optimization of the search of the shortest path
§ We know which kind of customer id are used by the PnS users
§ We know which kind of customer id are used for indexation
§ For each pair, the shortest paths are predefined in our application
PnS (according to the kind of customer ids)
Jean Armel Luce - Orange-France
19. 19
Search API in the graph
§ An in-house C++ library offers an API for an iterative breadth-first
graph exploration
§ Example: looking for H from A
E
Jean Armel Luce - Orange-France
C
H
F
D
G
I
A
B
SELECT * FROM graph WHERE credval1 IN (‘B’, ‘F’);
20. 20
Nb queries per search
§ Looking for a direct neighbour requires only 1 SELECT
§ Looking for a neighbour of a neighbour requires 2 SELECT
§ Looking for a neighbour of a neighbour of a neighbour requires 3
SELECT
Jean Armel Luce - Orange-France
§ …
21. 21
Search Response time
Number of searches/sec Response time per search (in ms)
Nearly 700 searches/sec 2ms < RTT < 3.5 ms
§ A search executed using 1, 2 or 3 reads è very low response time
(thanks to FusionIO and C++ code)
Jean Armel Luce - Orange-France
22. 22
Conclusions about Keyring
§ We had to rethink this feature,
because C* != RDBMS
§ At first glance, a graph looks
like an exotic design … but for
our use case, it works well with
C* … and FusionIO.
§ Favoring the access to data
through the partitioning key is
very efficient for getting a low
response time and a linear
scalability.
Jean Armel Luce - Orange-France
24. 24
Basic architecture of the Cassandra cluster
§ Cluster without Hadoop: 2 datacenters, 16 nodes in each DC
§ RF (DC1, DC2) = (3, 3)
§ CL = ONE or LOCAL_QUORUM for online queries
§ Requests from web servers in DC1 are sent to C* nodes in DC1
§ Requests from web servers in DC2 are sent to C* nodes in DC2
Pool
of
web
servers
DC1
Pool
of
web
servers
DC2
DC1 DC2
Jean Armel Luce - Orange-France
25. 25
Adding a new datacenter for analytics
§ Cluster with Hadoop/Hive: 3 datacenters, 16 nodes in DC1, 16
nodes in DC2, 4 nodes in DC3
§ RF (DC1, DC2, DC3) = (3, 3, 1)
§ Because RF = 1 in DC3, we need less storage space in this
datacenter
§ We favor cheaper disks (SATA) in DC3 rather than SSDs or
FusionIo cards
Jean Armel Luce - Orange-France
26. 26
Architecture of the Cassandra cluster with the
new datacenter for analytics
DC1 DC2
DC3
Jean Armel Luce - Orange-France
Pool
of
web
servers
DC1
Pool
of
web
servers
DC2
27. 27
Potential impacts of map reduce tasks for online
queries
DC1 DC2
DC3
Jean Armel Luce - Orange-France
Pool
of
web
servers
DC1
Pool
of
web
servers
DC2
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
Timeouts
HH
Timeouts
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
HH
Hinted Handoffs
for online
update queries
not replicated in DC3
Timeouts
due to
CL=ONE
used for
online
READ
queries
Map reduce tasks take all the
resources (CPU, RAM, IO, …)
28. 28
Isolation between online queries and map reduce tasks:
CL ANY
CL LOCAL_ONE
CL ONE
CL LOCAL_QUORUM
CL EACH_QUORUM
CL QUORUM
CL ALL
Solution for timeouts (online READ queries)
§ Use a LOCAL CONSISTENCY LEVEL:
– For map reduce tasks in DC3:
Jean Armel Luce - Orange-France
– LOCAL_ONE
– For online queries in DC1 or DC2:
– LOCAL_ONE
– LOCAL_QUORUM
LOCAL_ONE is available since C* 1.2.12
(cf. JIRA CASSANDRA-6238)
Timeouts
due to
CL=ONE
used for
online
READ
queries
29. 29
Solution for Hinted Handoffs (online WRITE
queries) 1/2
Guarantee on resources for online queries
Jean Armel Luce - Orange-France
§ Use CGROUPS:
– Can guarantee a minimum of CPU/RAM for online queries
– Cgroups cannot be used for I/O disks (Map tasks call C* processes
when reading data on disk)
Hinted Handoffs
for online
update queries
not replicated in DC3
30. 30
Solution for Hinted Handoffs (online WRITE queries)
2/2
Swap global and local read repair
chances
Jean Armel Luce - Orange-France
§ By default, in C* 1.2:
– read_repair_chance = 0.1
– dclocal_read_repair_chance = 0.0
§ For highly read tables, the read repairs are
not sent to DC3:
– Set read_repair_chance = 0.00
– Set dclocal_read_repair_chance = 0.1
Ø Less load and IO disks in DC3
DCLOCAL_READ_REPAIR_CHANCE=0.1 is
now the default since C* 2.0.9 (cf. JIRA
CASSANDRA_7320)
Hinted Handoffs
for online
update queries
not replicated in DC3
DC1 DC2
DC3
31. § 256 VN per C* node is usually recommended
§ At least 1 map task per virtual node in DC3
31
Tradeoff “ease of exploitation vs optimization”
– Disabling virtual nodes in DC3
adding new nodes in DC3 is less easy
shorten the execution time
– Enabling virtual nodes in DC3
adding new nodes in DC3 is easier,
What is the right number of vnodes ? 64 VN/node
looks good.
Jean Armel Luce - Orange-France
32. 32
Contributions and open sourced modules
§ Hive Handler open sourced by Orange
§ Works with CDH4.4 and C* 1.2.13
§ Feature added to this handler: authentication
Jean Armel Luce - Orange-France
§ Github:
https://github.com/Orange-OpenSource/cassandra_handler_for_hive
Thanks to Cyril Scetbon for this handler
33. 33
Conclusions about BYOHH
§ The installation of Hadoop & Hive is tricky, but we didn’t have choice for
analytics because CQL has many limitations
§ We had to rethink our architecture. Now, we are able to do analytics with
Hadoop + Hive with a better isolation between online and analytics queries.
§ We have also discovered an interesting ecosystem around C* which offers
more capabilities. With this ecosystem, we can benefit from the strengths of
C* and workaround some of the limitations.
Jean Armel Luce - Orange-France
38. 38
A few answers about hardware/OS version /Java version/
Cassandra version/Hadoop version
Jean Armel Luce - Orange-France
§ Hardware:
§ 16 nodes in DC1 and DC2 at the end of 2013:
§ 2 CPU 6cores each Intel® Xeon® 2.00 GHz
§ 64 GB RAM
§ FusionIO ® 800 GB MLC
§ 4 nodes in DC3
§ 24 GB de RAM
§ 2 CPU 6cores each Intel® Xeon® 2.00 GHz
§ SATA Disks 15Krpm
§ OS: Ubuntu Precise (12.04 LTS)
§ Cassandra version: 1.2.13
§ Hadoop version: CDH 4.4 (with Hive 0.10): Hadoop 2 with MRv1
§ Hive handler: https://github.com/Orange-OpenSource/cassandra_handler_for_hive
§ Java version: Java7u45 (GC CMS with option CMSClassUnloadingEnabled)
39. 39
A
few
answers
about
data
and
requests
Jean Armel Luce - Orange-France
§ Data types:
§ Volume: 6 TB at the end of 2013
§ elementary types: boolean, integer, string, date
§ collection types
§ complex types: json, xml (between 1 and 20 KB)
§ Requests:
§ 10.000 requests/sec at the end of 2013
§ 80% get
§ 20% set
§ Consistency level used by PnS for online queries and batch updates:
§ LOCAL_ONE (95% of the queries)
§ LOCAL_QUORUM (5% of the queries)