Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data.
https://www.bigdataspain.org/2017/talk/tuning-java-driver-for-apache-cassandra
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy
As a leader in the financial industry, Capital One applications generate huge amounts of data that require fast and accurate handling, storage and analysis. We are transforming how we report operational data to our internal users so that they can make quick and precise business decisions to serve our customers. As part of this transformation, we are building a new Go-based data processing framework that will enable us to transfer data from multiple data stores (RDBMS, files, etc.) to a single NoSQL database - Cassandra. This new NoSQL store will act as a reporting database that will receive data on a near real-time basis and serve the data through scorecards and reports. We would like to share our experience in defining this fast data platform and the methodologies used to model financial data in Cassandra.
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
Social media has become the new frontier for cyber-attackers. The explosive growth of this new communications platform, combined with the potential to reach millions of people through a single post, has provided a low barrier for exploitation. In this talk, we will focus on how Cassandra is used to enable our fight against bad actors on social media. In particular, we will discuss how we use Cassandra for anomaly detection, social mob alerting, trending topics, and fraudulent classification. We will also speak about our Cassandra data models, integration with Spark Streaming, and how we use KairosDB for our time series data. Watch us don our superhero-Cassandra capes as we fight against the bad guys!
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Real time analytics with Kafka and SparkStreamingAshish Singh
In a world where every “thing” is producing lots of data, ingesting and processing that large volume of data becomes a big problem. In today’s dynamic world, firms have to react to changing conditions very fast, or even better in real time. This presentation covers how two of the latest and greatest tools from Big Data community, Kafka and Spark Streaming, enables us to take on that challenge.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidStreamNative
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
The talk will focus on explaining why operational databases do not scale due to limitations in legacy transactional management.
https://www.bigdataspain.org/2017/talk/end-of-the-myth-ultra-scalable-transactional-management
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Capital One: Using Cassandra In Building A Reporting PlatformDataStax Academy
As a leader in the financial industry, Capital One applications generate huge amounts of data that require fast and accurate handling, storage and analysis. We are transforming how we report operational data to our internal users so that they can make quick and precise business decisions to serve our customers. As part of this transformation, we are building a new Go-based data processing framework that will enable us to transfer data from multiple data stores (RDBMS, files, etc.) to a single NoSQL database - Cassandra. This new NoSQL store will act as a reporting database that will receive data on a near real-time basis and serve the data through scorecards and reports. We would like to share our experience in defining this fast data platform and the methodologies used to model financial data in Cassandra.
Proofpoint: Fraud Detection and Security on Social MediaDataStax Academy
Social media has become the new frontier for cyber-attackers. The explosive growth of this new communications platform, combined with the potential to reach millions of people through a single post, has provided a low barrier for exploitation. In this talk, we will focus on how Cassandra is used to enable our fight against bad actors on social media. In particular, we will discuss how we use Cassandra for anomaly detection, social mob alerting, trending topics, and fraudulent classification. We will also speak about our Cassandra data models, integration with Spark Streaming, and how we use KairosDB for our time series data. Watch us don our superhero-Cassandra capes as we fight against the bad guys!
Using Apache Spark to analyze large datasets in the cloud presents a range of challenges. Different stages of your pipeline may be constrained by CPU, memory, disk and/or network IO. But what if all those stages have to run on the same cluster? In the cloud, you have limited control over the hardware your cluster runs on.
You may have even less control over the size and format of your raw input files. Performance tuning is an iterative and experimental process. It’s frustrating with very large datasets: what worked great with 30 billion rows may not work at all with 400 billion rows. But with strategic optimizations and compromises, 50+ TiB datasets can be no big deal.
By using Spark UI and simple metrics, explore how to diagnose and remedy issues on jobs:
Sizing the cluster based on your dataset (shuffle partitions)
Ingestion challenges – well begun is half done (globbing S3, small files)
Managing memory (sorting GC – when to go parallel, when to go G1, when offheap can help you)
Shuffle (give a little to get a lot – configs for better out of box shuffle) – Spill (partitioning for the win)
Scheduling (FAIR vs FIFO, is there a difference for your pipeline?)
Caching and persistence (it’s the cost of doing business, so what are your options?)
Fault tolerance (blacklisting, speculation, task reaping)
Making the best of a bad deal (skew joins, windowing, UDFs, very large query plans)
Writing to S3 (dealing with write partitions, HDFS and s3DistCp vs writing directly to S3)
Real time analytics with Kafka and SparkStreamingAshish Singh
In a world where every “thing” is producing lots of data, ingesting and processing that large volume of data becomes a big problem. In today’s dynamic world, firms have to react to changing conditions very fast, or even better in real time. This presentation covers how two of the latest and greatest tools from Big Data community, Kafka and Spark Streaming, enables us to take on that challenge.
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge_DavidStreamNative
The business value of data decreases rapidly after it is created, particularly in use cases such as fraud prevention, cybersecurity, and real-time system monitoring. The high-volume, high-velocity datasets used to feed these use cases often contain valuable, but perishable, insights that must be acted upon immediately.
In order to maximize the value of their data enterprises must fundamentally change their approach to processing real-time data to focusing reducing their decision latency on the perishable insights that exist within their real-time data streams. Thereby enabling the organization to act upon them while the window of opportunity is open.
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on.
In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc.
Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production.
We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
Eric Frenkiel, MemSQL CEO and co-founder and Gartner Catalyst. August 11, 2015, San Diego, CA. Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...confluent
Eventing and streaming open a world of compelling new possibilities to our software and platform designs. They can reduce time to decision and action while lowering total platform cost. But they are not a panacea. Understanding the edges and limits of these architectures can help you avoid painful missteps. This talk will focus on event-driven and streaming architectures and how Apache Kafka can help you implement these. It will also discuss key tradeoffs you will face along the way from partitioning schemes to the impact of availability vs. consistency (CAP Theorem). Finally, we’ll discuss some challenges of scale for patterns like Event Sourcing and how you can use other tools and even features of Kafka to work around them. This talk assumes a basic understanding of Kafka and distributed computing but will include brief refresher sections.
Building A Diverse Geo-Architecture For Cloud Native Applications In One DayVMware Tanzu
Presenter: Ben Laplanche, Product Manager, Pivotal Cloud Foundry
Companies turn to PaaS and Cloud Native Applications to gain agility and speed. To provide customer value, a fault tolerant infrastructure is essential. But what happens if an entire data center, region, or even country should go offline? Cassandra holds the key to keeping application state in sync through replication, whilst Pivotal Cloud Foundry provides easy deployment to multiple IaaS providers. It also comes complete with a managed service offering for DataStax Enterprise. This talk will discuss how this setup can be deployed in one day, including demonstrations and a walkthrough of the key concepts, approaches, and considerations.
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
SMACK is a combination of Spark, Mesos, Akka, Cassandra and Kafka. It is used for pipelined data architecture which is required for the real time data analysis and to integrate all the technology at the right place to efficient data pipeline.
In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
The .NET ecosystem spent years on the sidelines, watching the NoSQL and distributed computing movements flourish in ecosystems like Java, Node.JS, and others.
Over the past year or so, the .NET ecosystem took matters into its own hands and has feverishly started adopting new ideas like NoSQL, reactive programming, the actor model, and more!
In this talk we're going to explore what the modern .NET enterprise stack looks like: Cassandra, Akka.NET, and Windows Azure. Also, we'll share what exciting new possibilities this has been able to create for some of the largest .NET shops in the world.
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
At Databricks, we manage Apache Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer
Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data.
We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases.
This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
Tuning Java Driver for Apache CassandraNenad Bozic
Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data. Many use cases where Cassandra is natural fit require latency tuning in order to serve requests really fast. DataStax driver has many options, some less familiar, which can greatly influence performance aspect. This talk will focus on Java applications and options at your disposal in DataStax Java driver which became standard when you want to use this database. We will concentrate on both monitoring and tuning aspect of things and we will provide different options for different use cases. There is no silver bullet solution and having many options requires deep dive when you want to figure out right decision. This talk will narrow down options and give you push in the right direction.
NoSQL – Data Center Centric Application EnablementDATAVERSITY
The growth of Datacenter infrastructure is trending out of bounds, along with the pace in user activity and data generation in this digital era. However, the nature of the typical application deployment within the data center is changing to accommodate new business needs. Those changes introduce complexities in application deployment architecture and design, which cascade into requirements for a new generation of database technology (NoSQL) destined to ease that complexity. This webcast will discuss the modern data centers data centric application, the complexities that must be dealt with and common architectures found to describe and prescribe new data center aware services. Well look at the practical issues in implementation and overview current state of art in NoSQL database technology solving the problems of data center awareness in application development.
Using SparkML to Power a DSaaS (Data Science as a Service): Spark Summit East...Spark Summit
Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on.
In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc.
Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production.
We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedHostedbyConfluent
Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API
Accelerating Spark Genome Sequencing in Cloud—A Data Driven Approach, Case St...Spark Summit
Spark data processing is shifting from on-premises to cloud service to take advantage of its horizontal resource scalability, better data accessibility and easy manageability. However, fully utilizing the computational power, fast storage and networking offered by cloud service can be challenging without deep understanding of workload characterizations and proper software optimization expertise. In this presentation, we will use a Spark based programing framework – Genome Analysis Toolkit version 4 (GATK4, under development), as an example to present a process of configuring and optimizing a proficient Spark cluster on Google Cloud to speed up genome data processing. We will first introduce an in-house developed data profiling framework named PAT, and discuss how to use PAT to quickly establish the best combination of VM configurations and Spark configurations to fully utilize cloud hardware resources and Spark computational parallelism. In addition, we use PAT and other data profiling tools to identify and fix software hotspots in application. We will show a case study in which we identify a thread scalability issue of Java Instanceof operator. The fix in Scala language hugely improves performance of GATK4 and other Spark based workloads.
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesSingleStore
Eric Frenkiel, MemSQL CEO and co-founder and Gartner Catalyst. August 11, 2015, San Diego, CA. Watch the Pinterest Demo Video here: https://youtu.be/KXelkQFVz4E
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...confluent
Eventing and streaming open a world of compelling new possibilities to our software and platform designs. They can reduce time to decision and action while lowering total platform cost. But they are not a panacea. Understanding the edges and limits of these architectures can help you avoid painful missteps. This talk will focus on event-driven and streaming architectures and how Apache Kafka can help you implement these. It will also discuss key tradeoffs you will face along the way from partitioning schemes to the impact of availability vs. consistency (CAP Theorem). Finally, we’ll discuss some challenges of scale for patterns like Event Sourcing and how you can use other tools and even features of Kafka to work around them. This talk assumes a basic understanding of Kafka and distributed computing but will include brief refresher sections.
Building A Diverse Geo-Architecture For Cloud Native Applications In One DayVMware Tanzu
Presenter: Ben Laplanche, Product Manager, Pivotal Cloud Foundry
Companies turn to PaaS and Cloud Native Applications to gain agility and speed. To provide customer value, a fault tolerant infrastructure is essential. But what happens if an entire data center, region, or even country should go offline? Cassandra holds the key to keeping application state in sync through replication, whilst Pivotal Cloud Foundry provides easy deployment to multiple IaaS providers. It also comes complete with a managed service offering for DataStax Enterprise. This talk will discuss how this setup can be deployed in one day, including demonstrations and a walkthrough of the key concepts, approaches, and considerations.
Auto Scaling Systems With Elastic Spark Streaming: Spark Summit East talk by ...Spark Summit
Come explore a feature we’ve created that is not supported out-of-the-box: the ability to add or remove nodes to always-on real time Spark Streaming jobs. Elastic Spark Streaming jobs can automatically adjust to the demands of traffic or volume. Using a set of configurable utility classes, these jobs scale down when lulls are detected and scale up when load is too high. We process multiple TB’s per day with billions of events. Our traffic pattern experiences natural peaks and valleys with the occasional sustained unexpected spike. Elastic jobs has freed us from manual intervention, given back developer time, and has made a large financial impact through maximized resource utilization.
We hear a lot about lambda architectures and how Cassandra and Spark can help us crunch our data both in batch and real-time. After a year in the trenches, I'll share how we at The Weather Company built a general purpose, weather-scale event processing pipeline to make sense of billions of events each day. If you want to avoid much of the pain learning how to get it right, this talk is for you.
Tsinghua University: Two Exemplary Applications in ChinaDataStax Academy
In this talk, we will share the experiences of applying Cassandra with two real customers in China. In the first use case, we deployed Cassandra at Sany Group, a leading company of Machinery manufacturing, to manage the sensor data generated by construction machinery. By designing a specific schema and optimizing the write process, we successfully managed over 1.5 billion historical data records and achieved the online write throughput of 10k write operations per second with 5 servers. MapReduce is also used on Cassandra for valued-added services, e.g. operations management, machine failure prediction, and abnormal behavior mining. In the second use case, Cassandra is deployed in the China Meteorological Administration to manage the Meteorological data. We design a hybrid schema to support both slice query and time window based query efficiently. Also, we explored the optimized compaction and deletion strategy for meteorological data in this case.
SMACK is a combination of Spark, Mesos, Akka, Cassandra and Kafka. It is used for pipelined data architecture which is required for the real time data analysis and to integrate all the technology at the right place to efficient data pipeline.
In this talk Josep draws on his experience of building a data platform based on Cassandra and Spark to service the UK's foremost player in the connected homes market. Bringing streams of data online; productionising data science algorithms on spark; and delivering outputs via API's or Kafka messages.
Josep will explore the ups and the downs of bringing all this together and share what he's learned from 12 months of Cassandra and Spark development and operations.
Clipper: A Low-Latency Online Prediction Serving System: Spark Summit East ta...Spark Summit
Machine learning is being deployed in a growing number of applications which demand real-time, accurate, and robust predictions under heavy query load. However, most machine learning frameworks and systems only address model training and not deployment.
In this talk, we present Clipper, a general-purpose low-latency prediction serving system. Interposing between end-user applications and a wide range of machine learning frameworks, Clipper introduces a modular architecture to simplify model deployment across frameworks. Furthermore, by introducing caching, batching, and adaptive model selection techniques, Clipper reduces prediction latency and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks. We evaluated Clipper on four common machine learning benchmark datasets and demonstrate its ability to meet the latency, accuracy, and throughput demands of online serving applications. We also compared Clipper to the Tensorflow Serving system and demonstrate comparable prediction throughput and latency on a range of models while enabling new functionality, improved accuracy, and robustness.
The .NET ecosystem spent years on the sidelines, watching the NoSQL and distributed computing movements flourish in ecosystems like Java, Node.JS, and others.
Over the past year or so, the .NET ecosystem took matters into its own hands and has feverishly started adopting new ideas like NoSQL, reactive programming, the actor model, and more!
In this talk we're going to explore what the modern .NET enterprise stack looks like: Cassandra, Akka.NET, and Windows Azure. Also, we'll share what exciting new possibilities this has been able to create for some of the largest .NET shops in the world.
Monitoring Large-Scale Apache Spark Clusters at DatabricksAnyscale
At Databricks, we manage Apache Spark clusters for customers to run various production workloads. In this talk, we share our experiences in building a real-time monitoring system for thousands of Spark nodes, including the lessons we learned and the value we’ve seen from our efforts so far.
The was part of the talk presented at #monitorSF Meetup held at Databricks HQ in SF.
High cardinality time series search: A new level of scale - Data Day Texas 2016Eric Sammer
Modern search systems provide incredible feature sets, developer-friendly APIs, and low latency indexing and query response. By some measures, these systems operate "at scale," but rarely is that quantified. Customers of Rocana typically look to push ingest rates in excess of 1 million events per second, retaining years of data online for query, with the expectation of sub-second response times for any reasonably sized subset of data.
We quickly found that the tradeoffs made by general purpose search systems, while right for common use cases, were less appropriate for these high cardinality, large scale use cases.
This session details the architecture, tradeoffs, and interesting implementation decisions made in building a new time series optimized distributed search system using Apache Lucene, Kafka, and HDFS. Data ingestion and durability, index and metadata organization, storage, query scheduling and optimization, and failure modes will be covered. Finally, a summary of the results achieved will be shown.
First presentation for Savi's sponsorship of the Washington DC Spark Interactive. Discusses tips and lessons learned using Spark Streaming (24x7) to ingest and analyze Industrial Internet of Things (IIoT) data as part of a Lambda Architecture
Tuning Java Driver for Apache CassandraNenad Bozic
Apache Cassandra is distributed masterless column store database which is becoming mainstream for analytics and IoT data. Many use cases where Cassandra is natural fit require latency tuning in order to serve requests really fast. DataStax driver has many options, some less familiar, which can greatly influence performance aspect. This talk will focus on Java applications and options at your disposal in DataStax Java driver which became standard when you want to use this database. We will concentrate on both monitoring and tuning aspect of things and we will provide different options for different use cases. There is no silver bullet solution and having many options requires deep dive when you want to figure out right decision. This talk will narrow down options and give you push in the right direction.
NoSQL – Data Center Centric Application EnablementDATAVERSITY
The growth of Datacenter infrastructure is trending out of bounds, along with the pace in user activity and data generation in this digital era. However, the nature of the typical application deployment within the data center is changing to accommodate new business needs. Those changes introduce complexities in application deployment architecture and design, which cascade into requirements for a new generation of database technology (NoSQL) destined to ease that complexity. This webcast will discuss the modern data centers data centric application, the complexities that must be dealt with and common architectures found to describe and prescribe new data center aware services. Well look at the practical issues in implementation and overview current state of art in NoSQL database technology solving the problems of data center awareness in application development.
Scaling distributed data systems: A LinkedIn Case studySai Kiran Kanuri
Scaling a stateless system like web servers & application servers are pretty well understood, but scaling a stateful system has its own set of challenges. In this presentation, we hope to present our learnings & challenges faced while scaling a NoSQL database used within LinkedIn. Scaling a data system involves significant movement & replication of data within in a cluster. This can put considerable load on a system that is already running hot, affecting the service experience
We will look at some of the challenges & the approaches that we took.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
'Kanthaka' is an attempt to bring the benefits of Big Data technologies to telecom industry. The objective of the system is to analyze the CDRs (Caller Detail Record) and give results in near real time.
This is carried out as a final year project for my degree B. Sc. of Engineering (Hons) at University of Moratuwa as a team with 3 more colleagues, under the supervision of a senior lecturer and an industry expert.
The presentation exhibits the background, findings after literature review and proposing architecture of the system as for now. Any feed backs on improvements that can be made, are warmly welcome!
View On-Demand http://ecast.opensystemsmedia.com/403
Repeat Success, Not Mistakes; Use DDS Best Practices to Design Your Complex Distributed Systems
RTI Connext DDS is a powerful tool that lets you efficiently build and integrate complex distributed systems like no other technology – if you use it right. Be aware of how to get the most out of DDS and how to avoid common pitfalls when developing your system. We've developed RTI Connext best practices over the course of hundreds of customer projects and many years. In this webinar, you will learn how to apply the best practices we have developed to use RTI Connext DDS in ways that will enable your system to scale effectively with optimal performance, while avoiding missteps that will cause poor performance, non-determinism and scalability problems.
At Ottawa .NET User Group I had a talk on Cloud Design Patterns, External Config Pattern, Cache Aside, Federated Identity Pattern, Valet Key Pattern, Gatekeeper Pattern and the Circuit Breaker Pattern. These patterns depicts common problems in designing cloud-hosted applications and design patterns that offer guidance.
Similar to Tuning Java Driver for Apache Cassandra by Nenad Bozic at Big Data Spain 2017 (20)
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
Insights can only be as good as the data. The data quality domain is enormously large, so you need to understand your company pain points to know what to focus on first.
https://www.bigdataspain.org/2017/talk/big-data-big-quality
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
2gether is a financial platform based on Blockchain, Big Data and Artificial Intelligence that allows interaction between users and third-party services in a single interface.
https://www.bigdataspain.org/2017/talk/scaling-a-backend-for-a-big-data-and-blockchain-environment
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
All modern Big Data solutions, like Hadoop, Kafka or the rest of the ecosystem tools, are designed as distributed processes and as such include some sort of redundancy for High Availability.
https://www.bigdataspain.org/2017/talk/disaster-recovery-for-big-data
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
In this presentation, attendees will see how to speed up existing Hadoop and Spark deployments by just making Apache Ignite responsible for RAM utilization. No code modifications, no new architecture from scratch!
https://www.bigdataspain.org/2017/talk/boost-hadoop-and-spark-with-in-memory-technologies
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
The power of this new set of tools for Data Science. Is really easy to start applying these technics in your current workflow.
https://www.bigdataspain.org/2017/talk/data-science-for-lazy-people-automated-machine-learning
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
GPUs on the cloud as Infrastructure as a Service (IaaS) seem a commodity. However to efficiently distribute deep learning tasks on several GPUs is challenging.
https://www.bigdataspain.org/2017/talk/training-deep-learning-models-on-multiple-gpus-in-the-cloud
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
Unbalanced data is a specific data configuration that appears commonly in nature. Applying machine learning techniques to this kind of data is a difficult process, usually addressed by unbalanced reduction techniques.
https://www.bigdataspain.org/2017/talk/unbalanced-data-same-algorithms-different-techniques
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
Time series related problems have traditionally been solved using engineered features obtained by heuristic processes.
https://www.bigdataspain.org/2017/talk/state-of-the-art-time-series-analysis-with-deep-learning
Big Data Spain 2017
November 16th - 17th
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
Not long ago only banks and hedge funds could afford doing automated and High Frequency Trading, that is, the ability to send buy commodities in microseconds intervals.
https://www.bigdataspain.org/2017/talk/trading-at-market-speed-with-the-latest-kafka-features
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
The shift to stream processing at LinkedIn has accelerated over the past few years. We now have over 200 Samza applications in production processing more than 260B events per day.
https://www.bigdataspain.org/2017/talk/apache-samza-jake-maes
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
IBM has built a “Data Science Experience” cloud service that exposes Notebook services at web scale.
https://www.bigdataspain.org/2017/talk/the-analytic-platform-behind-ibms-watson-data-platform
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
Artificial Intelligence and Data-centric businesses.
https://www.bigdataspain.org/2017/talk/tbc
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
Ten years ago there were rumours of the death of causal inference. Big data was supposed to enable us to rely on purely correlational data to predict and control the world.
https://www.bigdataspain.org/2017/talk/why-big-data-didnt-end-causal-inference
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
The Meme of the Internet Index will be the new normal to analyze and predict facts and sensations which go around the Internet.
https://www.bigdataspain.org/2017/talk/meme-index-analyzing-fads-and-sensations-on-the-internet
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
Geotab is a leader in the expanding world of Internet of Things (IoT) and telematics industry with Big Data.
https://www.bigdataspain.org/2017/talk/vehicle-big-data-that-drives-smart-city-advancement
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
In recent years Machine Learning (ML) and especially Deep Learning (DL) have achieved great success in many areas such as visual recognition, NLP or even aiding in medical research.
https://www.bigdataspain.org/2017/talk/attacking-machine-learning-used-in-antivirus-with-reinforcement
Big Data Spain 2017
16th - 17th Kinépolis Madrid
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
Primary function of banking sector is promoting economic activity; which means “commerce”, exchanging what someone produces-has for something that someone consumes-desires.
https://www.bigdataspain.org/2017/talk/more-people-less-banking-blockchain
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
Bol.com has been an early Hadoop user: since 2008 where it was first built for a recommendation algorithm.
https://www.bigdataspain.org/2017/talk/make-the-elephant-fly-once-again
Big Data Spain 2017
16th - 17th Kinépolis Madrid
Feature selection for Big Data: advances and challenges by Verónica Bolón-Can...Big Data Spain
In an era of growing data complexity and volume and the advent of Big Data, feature selection has a key role to play in helping reduce high-dimensionality in machine learning problems.
https://www.bigdataspain.org/2017/talk/feature-selection-for-big-data-advances-and-challenges
Big Data Spain 2017
November 16th - 17th Kinépolis Madrid
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
8. Cassandra Overview
• partitioned data with tunable consistency
• replication factor - how many replicas
• masterless architecture
• native multi-datacenter support
14. Use Cases
• when high availability is crucial, and eventual consistency is tolerable
• event sourcing
• logging continuous streams of data
• deep visitor analytics
• early prototyping with significant query changes
• referential integrity required
• dynamic access patterns on data
21. Pooling options
• driver communicates with cluster with pool of connections
• changed between V2 and V3 version of protocol (core lowered to 1)
• going for more requests on connection can put more load to cluster
• add monitoring of in flight queries on driver side and tune for your use case
23. Speculative executions
• spawn additional queries to other nodes after configured time
http://docs.datastax.com/en/developer/java-driver/3.1/manual/speculative_execution/
25. Timeouts
• driver read timeout vs server read timeout
• driver settings for all queries or per query settings
• setReadTimeoutMillis and setConnectionTimeoutMillis
26. Retry policies
• fail early and retry
• add retry policy or speculative execution
• downgrading retry policy if inconsistent data is more important than no data
28. Click stream and IoT measurements
• visualize measurements from many devices
• fast access with tolerable inconsistencies
• DC aware and token aware policy to land on local node with data
• lower consistency level (ONE) or use downgrading retry policy
• use speculative executions to query more nodes if cluster can manage load
29. Mission critical data with tolerable performance
• stock data in warehouse used to compare with ERP system
• high consistency (read + write > replication factor)
• retry and reconnect policy is a must
• choose lower requests per connection numbers not to overload cluster
• set lower read timeout to fail early and retry
30. Write heavy low latency read use case
• ad serving (store user analytics and serve ads fast)
• separate read and write for different tuning options
• latency aware policy on reads to choose always fast performing nodes
• lower down read timeout on driver and server to fail early
• increase maximum requests per connection
32. Conclusion and take aways
• know your use case and know your database
• each tuning options requires good monitoring
TEST
ADJUST MEASURE
33. Links
• SmartCat Blog post - Tuning Java driver for Apache Cassandra - part 1
• SmartCat Blog post - Tuning Java driver for Apache Cassandra - part 2
• Use case example - Tuning for heavy write and low latency read scenario