This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can efficiently process spatiotemporal data (space and time). In order to find location-specific anomalies, we need ways to represent locations, to index locations, and to query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
ApacheCon NA 2020 Geospatial track presentation https://www.apachecon.com/acah2020/tracks/geospatial.html
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
Geospatial data makes it possible to leverage location, location, location! Geospatial data is taking off, as companies realize that just about everyone needs the benefits of geospatially aware applications. As a result there are no shortages of unique but demanding use cases of how enterprises are leveraging large-scale and fast geospatial big data processing. The data must be processed in large quantities - and quickly - to reveal hidden spatiotemporal insights vital to businesses and their end users. In the rush to tap into geospatial data, many enterprises will find that representing, indexing and querying geospatially-enriched data is more complex than they anticipated - and might bring about tradeoffs between accuracy, latency, and throughput.This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra.
Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations.
We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes.
For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin.
To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Updated version of presentation for 30 April 2020 Melbourne Distributed Meetup (online)
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
This version is a slightly shorter version of previous ones.
Google Cloud Special Edition, Sydney Data Engineering Meetup
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/269146076/
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4.
About the Speaker
John Schulz Prinicipal Consultant, The Pythian Group
John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
Discussion about the evolution of metrics in Cassandra from 1.0 to 3.0, how the metric changes impact operational tooling, pros and cons for different metric representations, and how and why DataStax OpsCenter collects and stores metrics. Includes a deep dive on how DataStax OpsCenter represents and stores the different kinds of metrics to provide visibility beyond simple cluster averages both behind the scenes and in the rendering.
About the Speaker
Chris Lohfink Software Engineer, DataStax
I am a Java, Python, and Clojure developer who has been using Cassandra in an application development and operational context for the last five years. The last nearly two years I have been working with the OpsCenter Monitoring team at DataStax to improve the accuracy and breadth of the visualization tooling available.
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
Go90 is a mobile entertainment platform offering access to live and on demand videos. We built the web services platform and social features like activity feed for go90 by making heavy use of Cassandra and Scala, and would like to share what we learned during development and while operating go90. In this presentation, we cover our data model evolution from the initial prototypes to the current production version and the significant performance gain by using a better data model. We will explain how we apply time series data modeling and the benefits of using expiring columns with DateTieredCompactionStrategy. We will also talk about interesting experiences related to table modifications, tombstones and table pagination. On the operations side, we will discuss our findings on java driver usage, performance, monitoring, cluster maintenance, version upgrade, 2-way ssl and many more. We hope you can learn from our mistakes instead of making them yourself!
About the Speakers
Christopher Webster Software Engineer, AOL
Christopher Webster works on the web services platform for the go90 AOL project. Previously he was a Computer Scientist for the Mission Control Technologies project at NASA Ames Center. Chris worked as a senior staff engineer at Sun Microsystems for Project zembly, the cloud development and deployment environment as well as technical lead in many NetBeans projects. Chris is an author of the NetBeans Field Guide and Assemble the Social Web With Zembly.
Thomas Ng Software Engineer, AOL
Thomas Ng is a software engineer at AOL, building web services for the go90 mobile entertainment platform using Cassandra, Scala and Kafka.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
Geospatial data makes it possible to leverage location, location, location! Geospatial data is taking off, as companies realize that just about everyone needs the benefits of geospatially aware applications. As a result there are no shortages of unique but demanding use cases of how enterprises are leveraging large-scale and fast geospatial big data processing. The data must be processed in large quantities - and quickly - to reveal hidden spatiotemporal insights vital to businesses and their end users. In the rush to tap into geospatial data, many enterprises will find that representing, indexing and querying geospatially-enriched data is more complex than they anticipated - and might bring about tradeoffs between accuracy, latency, and throughput.This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra.
Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations.
We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes.
For each representation we also explore possible Cassandra implementations including: Clustering columns, Secondary indexes, Denormalized tables, and the Cassandra Lucene Index Plugin.
To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
Updated version of presentation for 30 April 2020 Melbourne Distributed Meetup (online)
Massively Scalable Real-time Geospatial Data Processing with Apache Kafka and...Paul Brebner
This presentation will explore how we added location data to a scalable real-time anomaly detection application, built around Apache Kafka, and Cassandra. Kafka and Cassandra are designed for time-series data, however, it’s not so obvious how they can process geospatial data. In order to find location-specific anomalies, we need a way to represent locations, index locations, and query locations. We explore alternative geospatial representations including: Latitude/Longitude points, Bounding Boxes, Geohashes, and go vertical with 3D representations, including 3D Geohashes. To conclude we measure and compare the query throughput of some of the solutions, and summarise the results in terms of accuracy vs. performance to answer the question “Which geospatial data representation and Cassandra implementation is best?”
This version is a slightly shorter version of previous ones.
Google Cloud Special Edition, Sydney Data Engineering Meetup
https://www.meetup.com/Sydney-Data-Engineering-Meetup/events/269146076/
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
What is in All of Those SSTable Files Not Just the Data One but All the Rest ...DataStax
Have you ever wondered what is in all of those SSTable files and how it helps Cassandra find and manage your data? If you go to the Datastax website they will give you a high level explanation of what is in each file. In this talk we will go much deeper explaining each file and walking through a dump of its contents. We will also explore the differences between Cassandra 2.1 and 3.4.
About the Speaker
John Schulz Prinicipal Consultant, The Pythian Group
John has 40 of years experience working with data. Data in files and in Databases from flat files through ISAM to relational databases and most recently NoSQL. For the last 15 he's worked on a variety of Open source technologies including MySQL, PostgreSQL, Cassandra, Riak, Hadoop and Hbase. He has been working with Cassandra since 2010. For the last eighteen months he has been working for The Pythian Group to help their customers improve their existing databases and select new ones.
Discussion about the evolution of metrics in Cassandra from 1.0 to 3.0, how the metric changes impact operational tooling, pros and cons for different metric representations, and how and why DataStax OpsCenter collects and stores metrics. Includes a deep dive on how DataStax OpsCenter represents and stores the different kinds of metrics to provide visibility beyond simple cluster averages both behind the scenes and in the rendering.
About the Speaker
Chris Lohfink Software Engineer, DataStax
I am a Java, Python, and Clojure developer who has been using Cassandra in an application development and operational context for the last five years. The last nearly two years I have been working with the OpsCenter Monitoring team at DataStax to improve the accuracy and breadth of the visualization tooling available.
What We Learned About Cassandra While Building go90 (Christopher Webster & Th...DataStax
Go90 is a mobile entertainment platform offering access to live and on demand videos. We built the web services platform and social features like activity feed for go90 by making heavy use of Cassandra and Scala, and would like to share what we learned during development and while operating go90. In this presentation, we cover our data model evolution from the initial prototypes to the current production version and the significant performance gain by using a better data model. We will explain how we apply time series data modeling and the benefits of using expiring columns with DateTieredCompactionStrategy. We will also talk about interesting experiences related to table modifications, tombstones and table pagination. On the operations side, we will discuss our findings on java driver usage, performance, monitoring, cluster maintenance, version upgrade, 2-way ssl and many more. We hope you can learn from our mistakes instead of making them yourself!
About the Speakers
Christopher Webster Software Engineer, AOL
Christopher Webster works on the web services platform for the go90 AOL project. Previously he was a Computer Scientist for the Mission Control Technologies project at NASA Ames Center. Chris worked as a senior staff engineer at Sun Microsystems for Project zembly, the cloud development and deployment environment as well as technical lead in many NetBeans projects. Chris is an author of the NetBeans Field Guide and Assemble the Social Web With Zembly.
Thomas Ng Software Engineer, AOL
Thomas Ng is a software engineer at AOL, building web services for the go90 mobile entertainment platform using Cassandra, Scala and Kafka.
Cassandra is the dominant data store used at Netflix and it's health is critical to many of its services. In this talk we will share details of the recent redesign of our health monitoring system and how we leveraged a reactive stream processing system to give us a real-time view our entire fleet while dramatically improving accuracy and reducing false alarms in our alerting.
About the Speaker
Jason Cacciatore Senior Software Engineer, Netflix
Jason Cacciatore is a Senior Software Engineer at Netflix, where he's been working for the past several years. He's interested in stateful distributed systems and has a diverse background in technology. In his spare time he enjoys spending time with his wife and two sons, reading non-fiction, and watching Netflix documentaries.
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
The earliest relational databases were monolithic on-premise systems that were powerful and full-featured. Fast forward to the Internet and NoSQL: BigTable, DynamoDB and Cassandra. These distributed systems were built to scale out for ballooning user bases and operations. As more and more companies vied to be the next Google, Amazon, or Facebook, they too "required" horizontal scalability.
But in a real way, NoSQL and even NewSQL have forgotten single node performance where scaling out isn't an option. And single node performance is important because it allows you to do more with much less. With a smaller footprint and simpler stack, overhead decreases and your application can still scale.
In this talk, we describe TimescaleDB's methods for single node performance. The nature of time-series workloads and how data is partitioned allows users to elastically scale up even on single machines, which provides operational ease and architectural simplicity, especially in cloud environments.
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...DataStax Academy
Video: http://youtu.be/B-bTPSwhsDY
Abstract
Patrick McFadin (@PatrickMcFadin), Chief Evangelist for Apache Cassandra at DataStax, will be presenting an introduction to Cassandra as a key player in database technologies. Both large and small companies alike chose Apache Cassandra as their database solution and Patrick will be presenting on why they made that choice.
Patrick will also be discussing Cassandra's architecture, including: data modeling, time-series storage and replication strategies, providing a holistic overview of how Cassandra works and the best way to get started.
About Patrick McFadin
Prior to working for DataStax, Patrick was the Chief Architect at Hobsons, an education services company. His responsibilities included ensuring product availability and scaling for all higher education products. Prior to this position, he was the Director of Engineering at Hobsons which he came to after they acquired his company, Link-11 Systems, a software services company. While at Link-11 Systems, he built the first widely popular CRM system for universities, Connect. He obtained a BS in Computer Engineering from Cal Poly, San Luis Obispo and holds the distinction of being the only recipient of a medal (asanyone can find out) for hacking while serving in the US Navy.
Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm.
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.
Aggregation is ubiquitous and data is no exception. This slide presents data aggregation concept and The HDF Group's approach to the data aggregation problem in Earth Science. A n JPSS data aggregation tool called "nagg" is explained as a showcase example.
Slides from my talk at Cassandra Summit 2016 on troubleshooting Cassandra. This is a reprise of my popular talk from last summit, reorganized, expanded, and updated for Cassandra 3.0. In it I share the secrets I've learned in four years of supporting hundreds of customers using Apache Cassandra and DataStax Enterprise. Be sure to check out presenter notes for additional tips and links to further resources.
InfiniCortex and the Renaissance in Polish Supercomputing inside-BigData.com
In this deck from the DDN User Group at SC16, Marek Michalewicz, Deputy Director, ICM, University of Warsaw, presents: From Singapore to Warsaw: Infinicortex and the Renaissance in Polish Supercomputing.
"Over the past two years, InfiniCortex has clearly demonstrated that IB can perform over trans-continental distances, exploiting this technology to create a “Galaxy of Supercomputers” (a term coined by Marek Michalewicz and Yuefan Deng whose research focus is on mathematically optimal network topologies for supercomputers), a worldwide IB network spanning sites across Asia, Europe and North America. Initiated and led by A*STAR CRC (Agency for Science, Technology and Research - Computational Resource Centre in Singapore), the project hit its first major breakthrough at SuperComputing14 (SC14), showcasing a first-time-ever 100G IB transcontinental connection from Singapore to the SC14 venue in New Orleans (USA). This was made possible primarily thanks to the support of TATA Telecommunications, which provided the 100G trans-pacific link from Singapore to the US, and Obisidian Strategics, the Canadian manufacturer of the IB long-range equipment, that made available a number of units to be deployed in the participating sites."
Watch the video presentation: http://wp.me/p3RLHQ-g5g
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Introduction to the Oakforest-PACS Supercomputer in Japaninside-BigData.com
In this deck from the DDN User Group at SC16, Prof. Taisuke Boku from the University of Tsukuba & JCAHPC presents: Oakforest-PACS: Overview of the New JCAHPC Computing Facility.
The University of Tokyo, the University of Tsukuba, and Fujitsu Limited recently announced that the Oakforest-PACS massively parallel cluster-type supercomputer, built by Fujitsu and operated by the Joint Center for Advanced High Performance Computing (JCAHPC), has achieved a LINPACK performance result of 13.55 petaflops, as ranked in the November Top500 list for supercomputer performance. Given this, Oakforest-PACS has surpassed the K computer to officially become the highest performance supercomputer in Japan. The system's peak performance is 25 petaflops, which is about 2.2 times that of the K computer.
"Thanks to DDN’s IME Burst Buffer, researchers using Oakforest-PACS at the Joint Center for Advanced High Performance Computing (JCAHPC) are able to improve modeling of fundamental physical systems and advance understanding of requirements for Exascale-level systems architectures. With DDN’s advanced technology, JCAHPC has achieved effective I/O performance exceeding 1TB/s in writing tens of thousands of processes to the same file."
Watch the video presentation: http://wp.me/p3RLHQ-g3D
Learn more: http://ddn.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn how to achieve scale with MongoDB. In this presentation, we cover three different ways to scale MongoDB, including optimization, vertical scaling, and horizontal scaling.
Cassandra gives operations a lot of control over the system by forcing them to make a lot of decisions they'd rather not around cluster topology changes. Hecuba2 is a tool that helps to automate that. Hecuba2 has a library component and an agent component. The library provides an API for manipulating Cassandra topologies and the agent runs on all Cassandra hosts and converges the existing topology to the generated topology.
Hecuba2 is running in production at Spotify and has been remarkably bug free since being rolled out. It supports creating a cluster, expanding a cluster, and replacing nodes.
This talk will cover the design of Hecuba2 and how to deploy it.
About the Speaker
Radovan Zvoncek Backend Engineer, Spotify
After graduating a master degree in distributed systems I've joined Spotify as a backend engineer. For the past three years I've been involved in Cassandra operations, as well as the cultivation of the Cassandra ecosystem at Spotify.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
RasterFrames: Enabling Global-Scale Geospatial Machine LearningAstraea, Inc.
RasterFrames™, a proposed LocationTech project, brings the power of Spark SQL and Spark ML to the analysis of global-scale geospatial-temporal raster data. Employing the rich geospatial primitives of LocationTech GeoTrellis and GeoMesa, RasterFrames provides scientists, data scientists and software developers with a unified data and compute model for building image processing pipelines for ETL, data-product creation, statistical analysis, supervised & unsupervised machine learning, and deep learning. Data scientists particularly benefit from the DataFrame-centric entrypoint into big data geospatial analytics.
This talk will introduce RasterFrames, explaining the need it fulfills, the capabilities it provides, and context for determining if RasterFrames is right for the problems you're trying to solve.
By Simeon Fitch
Scalable Data Modeling by Example (Carlos Alonso, Job and Talent) | Cassandra...DataStax
Cassandra is getting more and more buzz and that means two things, more development and more issues. Some issues are unavoidable, but some of them are, just by understanding how our tooling works.
In this talk I'd like to review the core concepts on which Cassandra is built and how they impose the way we should work with it using some examples that will hopefully give you both a 'Quick Reference' and a 'Checklist' to go through every time you want to build scalable data models.
About the Speaker
Carlos Alonso Software Engineer, Job and Talent
Carlos received his Masters CS at Salamanca University, Spain. He worked a few years there in a digital agency, gaining expertise on a very wide range of technologies before moving to London where he narrowed down the focus on to the backend and data engineering disciplines. The latest step in his professional career was to move back to Madrid to work for Job and Talent where he currently helps on building the best candidate-job opening matching technology. Aside from work he likes sharing as much as he can by public speaking, mentoring or getting involved in OSS or OpenData initiatives.
A Cassandra + Solr + Spark Love Triangle Using DataStax EnterprisePatrick McFadin
Wait! Back away from the Cassandra 2ndary index. It’s ok for some use cases, but it’s not an easy button. "But I need to search through a bunch of columns to look for the data and I want to do some regression analysis… and I can’t model that in C*, even after watching all of Patrick McFadins videos. What do I do?” The answer, dear developer, is in DSE Search and Analytics. With it’s easy Solr API and Spark integration so you can search and analyze data stored in your Cassandra database until your heart’s content. Take our hand. WE will show you how.
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Databases Have Forgotten About Single Node Performance, A Wrongheaded Trade OffTimescale
The earliest relational databases were monolithic on-premise systems that were powerful and full-featured. Fast forward to the Internet and NoSQL: BigTable, DynamoDB and Cassandra. These distributed systems were built to scale out for ballooning user bases and operations. As more and more companies vied to be the next Google, Amazon, or Facebook, they too "required" horizontal scalability.
But in a real way, NoSQL and even NewSQL have forgotten single node performance where scaling out isn't an option. And single node performance is important because it allows you to do more with much less. With a smaller footprint and simpler stack, overhead decreases and your application can still scale.
In this talk, we describe TimescaleDB's methods for single node performance. The nature of time-series workloads and how data is partitioned allows users to elastically scale up even on single machines, which provides operational ease and architectural simplicity, especially in cloud environments.
Cassandra Community Webinar | Getting Started with Apache Cassandra with Patr...DataStax Academy
Video: http://youtu.be/B-bTPSwhsDY
Abstract
Patrick McFadin (@PatrickMcFadin), Chief Evangelist for Apache Cassandra at DataStax, will be presenting an introduction to Cassandra as a key player in database technologies. Both large and small companies alike chose Apache Cassandra as their database solution and Patrick will be presenting on why they made that choice.
Patrick will also be discussing Cassandra's architecture, including: data modeling, time-series storage and replication strategies, providing a holistic overview of how Cassandra works and the best way to get started.
About Patrick McFadin
Prior to working for DataStax, Patrick was the Chief Architect at Hobsons, an education services company. His responsibilities included ensuring product availability and scaling for all higher education products. Prior to this position, he was the Director of Engineering at Hobsons which he came to after they acquired his company, Link-11 Systems, a software services company. While at Link-11 Systems, he built the first widely popular CRM system for universities, Connect. He obtained a BS in Computer Engineering from Cal Poly, San Luis Obispo and holds the distinction of being the only recipient of a medal (asanyone can find out) for hacking while serving in the US Navy.
Detailed design for a robust counter as well as design for a completely on-line multi-armed bandit implementation that uses the new Bayesian Bandit algorithm.
Cassandra Community Webinar | In Case of Emergency Break GlassDataStax
The design of Apache Cassandra allows applications to provide constant uptime. Peer-to-Peer technology ensures there are no single points of failure, and the Consistency guarantees allow applications to function correctly while some nodes are down. There is also a wealth of information provided by the JMX API and the system log. All of this means that when things go wrong you have the time, information and platform to resolve them without downtime. This presentation will cover some of the common, and not so common, performance issues, failures and management tasks observed in running clusters. Aaron will discuss how to gather information and how to act on it. Operators, Developers and Managers will all benefit from this exposition of Cassandra in the wild.
Aggregation is ubiquitous and data is no exception. This slide presents data aggregation concept and The HDF Group's approach to the data aggregation problem in Earth Science. A n JPSS data aggregation tool called "nagg" is explained as a showcase example.
Slides from my talk at Cassandra Summit 2016 on troubleshooting Cassandra. This is a reprise of my popular talk from last summit, reorganized, expanded, and updated for Cassandra 3.0. In it I share the secrets I've learned in four years of supporting hundreds of customers using Apache Cassandra and DataStax Enterprise. Be sure to check out presenter notes for additional tips and links to further resources.
InfiniCortex and the Renaissance in Polish Supercomputing inside-BigData.com
In this deck from the DDN User Group at SC16, Marek Michalewicz, Deputy Director, ICM, University of Warsaw, presents: From Singapore to Warsaw: Infinicortex and the Renaissance in Polish Supercomputing.
"Over the past two years, InfiniCortex has clearly demonstrated that IB can perform over trans-continental distances, exploiting this technology to create a “Galaxy of Supercomputers” (a term coined by Marek Michalewicz and Yuefan Deng whose research focus is on mathematically optimal network topologies for supercomputers), a worldwide IB network spanning sites across Asia, Europe and North America. Initiated and led by A*STAR CRC (Agency for Science, Technology and Research - Computational Resource Centre in Singapore), the project hit its first major breakthrough at SuperComputing14 (SC14), showcasing a first-time-ever 100G IB transcontinental connection from Singapore to the SC14 venue in New Orleans (USA). This was made possible primarily thanks to the support of TATA Telecommunications, which provided the 100G trans-pacific link from Singapore to the US, and Obisidian Strategics, the Canadian manufacturer of the IB long-range equipment, that made available a number of units to be deployed in the participating sites."
Watch the video presentation: http://wp.me/p3RLHQ-g5g
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Introduction to the Oakforest-PACS Supercomputer in Japaninside-BigData.com
In this deck from the DDN User Group at SC16, Prof. Taisuke Boku from the University of Tsukuba & JCAHPC presents: Oakforest-PACS: Overview of the New JCAHPC Computing Facility.
The University of Tokyo, the University of Tsukuba, and Fujitsu Limited recently announced that the Oakforest-PACS massively parallel cluster-type supercomputer, built by Fujitsu and operated by the Joint Center for Advanced High Performance Computing (JCAHPC), has achieved a LINPACK performance result of 13.55 petaflops, as ranked in the November Top500 list for supercomputer performance. Given this, Oakforest-PACS has surpassed the K computer to officially become the highest performance supercomputer in Japan. The system's peak performance is 25 petaflops, which is about 2.2 times that of the K computer.
"Thanks to DDN’s IME Burst Buffer, researchers using Oakforest-PACS at the Joint Center for Advanced High Performance Computing (JCAHPC) are able to improve modeling of fundamental physical systems and advance understanding of requirements for Exascale-level systems architectures. With DDN’s advanced technology, JCAHPC has achieved effective I/O performance exceeding 1TB/s in writing tens of thousands of processes to the same file."
Watch the video presentation: http://wp.me/p3RLHQ-g3D
Learn more: http://ddn.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Learn how to achieve scale with MongoDB. In this presentation, we cover three different ways to scale MongoDB, including optimization, vertical scaling, and horizontal scaling.
Cassandra gives operations a lot of control over the system by forcing them to make a lot of decisions they'd rather not around cluster topology changes. Hecuba2 is a tool that helps to automate that. Hecuba2 has a library component and an agent component. The library provides an API for manipulating Cassandra topologies and the agent runs on all Cassandra hosts and converges the existing topology to the generated topology.
Hecuba2 is running in production at Spotify and has been remarkably bug free since being rolled out. It supports creating a cluster, expanding a cluster, and replacing nodes.
This talk will cover the design of Hecuba2 and how to deploy it.
About the Speaker
Radovan Zvoncek Backend Engineer, Spotify
After graduating a master degree in distributed systems I've joined Spotify as a backend engineer. For the past three years I've been involved in Cassandra operations, as well as the cultivation of the Cassandra ecosystem at Spotify.
C* for Deep Learning (Andrew Jefferson, Tracktable) | Cassandra Summit 2016DataStax
A deep learning startup has a requirement for a robust and scalable data architecture. Training a Deep Neural Network requires 10s-100s of millions of examples consisting of data and metadata. In addition to training it is necessary to support test/validation, data exploration and more traditional data science analytics workloads. As a startup we have minimal resources and an engineering team of 1.
Cassandra, Spark and Kafka running on Mesos in AWS is a scalable architecture that is fast and easy to set up and maintain to deliver a data architecture for Deep Learning.
About the Speaker
Andrew Jefferson VP Engineering, Tractable
A software engineer specialising in realtime data systems. I've worked at companies from Startups to Apple on applications ranging from Ticketing to Genetics. Currently building data systems for training and exploiting Deep Neural Networks.
RasterFrames: Enabling Global-Scale Geospatial Machine LearningAstraea, Inc.
RasterFrames™, a proposed LocationTech project, brings the power of Spark SQL and Spark ML to the analysis of global-scale geospatial-temporal raster data. Employing the rich geospatial primitives of LocationTech GeoTrellis and GeoMesa, RasterFrames provides scientists, data scientists and software developers with a unified data and compute model for building image processing pipelines for ETL, data-product creation, statistical analysis, supervised & unsupervised machine learning, and deep learning. Data scientists particularly benefit from the DataFrame-centric entrypoint into big data geospatial analytics.
This talk will introduce RasterFrames, explaining the need it fulfills, the capabilities it provides, and context for determining if RasterFrames is right for the problems you're trying to solve.
By Simeon Fitch
InfluxDB IOx Tech Talks: A Rusty Introduction to Apache Arrow and How it App...InfluxData
InfluxDB IOx Tech Talks - December 2020
A Rusty Introduction to Apache Arrow and How it Applies to a Time Series Database
This session will start with a tech talk from an InfluxDB IOx team member. This is your chance to interact directly with Influxers who are available to answer your questions about all things InfluxDB IOx and time series — including Paul Dix, Founder and CTO of InfluxData. This event will last about an hour and there will be time for live Q&A.
3D visualization today has ever-expanding applications in science, education, engineering, medicine, interactive multimedia like games, etc. Producers of graphics processing units (GPU) – are specialized electronic circuits designed to rapidly manipulate and alter computer memory in such a way so as to massively accelerate the visualization of 3D environments – bring ever faster products to the market every six months which is rapidly increasing the possibilities of near future visualization/simulation methods.
Cracking the nut, solving edge ai with apache tools and frameworksTimothy Spann
27-April-2021. Developer Week Europe. OPEN Stage A. 11:00
Tspann cracking the nut, solving edge ai with apache tools and frameworks
Using Apache Flink, Apache Airflow, Apache Arrow, Apache NiFi, Apache Kafka, Apache MXNet, DJL.AI, Apache Tika, Apache OpenNLP, Apache Kudu, Apache Impala, Apache HBase and more open source tools for edge AI.
Database@Home - Maps and Spatial Analyses: How to use themTammy Bednar
The converged Oracle Database lets you store, manage, and query location information. Built-in SQL, REST, Java, and JavaScript APIs with JSON let you analyze addresses, GPS coordinates, coverage areas, territories and other geographic information. The first half of the session will describe the rich set of spatial analysis and spatial application development features of the database. The second half will be a case study of how these features are used in a Marketing Analytics platform that helps identify prospective customers and provide a compelling customer experience.
This is a slide deck that I have been using to present on GeoTrellis for various meetings and workshops. The information is speaks to GeoTrellis pre-1.0 release in Q4 of 2016.
Stockage, manipulation et analyse de données matricielles avec PostGIS RasterACSG Section Montréal
La plus importantes nouveautés de la base de données spatiale open source PostgreSQL/PostGIS 2.0 est le support pour les données raster. PostGIS Raster comprend un outil d’importation similaire à shp2pgsql basé sur GDAL et une série d’opérateurs SQL pour la manipulation et l'analyse des données matricielles. Le nouveau type RASTER est géoréférencé, multi-résolutions et multi-bandes et il supporte une valeur nulle (nodata) et un type de valeur de pixel par bande. PostGIS raster s’inspire de la simplicité de l’expérience vecteur offerte par PostGIS pour rendre toutes les opérations raster aussi simples que possible. Comme pour une couverture vecteur, une couverture raster est divisée en un ensemble d’enregistrements (une ligne = une tuile) stockés dans une seule table (contrairement à Oracle Spatial qui utilise deux types et donc deux tables ou plus). Il est possible d’importer une couverture complète et de la retuiler en une seule commande avec l’outil d’importation et de multiples résolutions de la même couverture peuvent être importées dans des tables adjacentes. Les propriétés des objets raster et de chacune des bandes peuvent être consultées et modifiées ainsi que les valeurs des pixels. Des fonctions existent pour obtenir le minimum, le maximum, la somme, la moyenne, la déviation standard, l’histogramme d’une tuile ou d’une couverture complète. Les fonctions ST_Intersection() et ST_Intersects() fonctionnent pratiquement de manière transparente entre des données raster et vecteur et une série de fonctions pour l’algèbre matricielle (ST_MapAlgebra()) permet de faire de l’analyse de type raster. Il est possible de reclasser les bandes et de les convertir en n’importe quel format d’écriture GDAL. Des fonctions pour générer des rasters et des bandes existent également pour du développement PL/pgSQL. Un driver GDAL pour convertir les couvertures raster en fichiers images est en développement et des plugins pour QGIS et svSIG existent déjà pour les visualiser.
I gave this talk at Buzzwords just now to fill in for an ill speaker.
The topics include things that are being added to or taken out of Mahout. These include cruft (out), fast clustering (in), nearest neighbor search (in), Pig bindings for Mahout (who knows).
See conference video - http://www.lucidimagination.com/devzone/events/conferences/ApacheLuceneEurocon2011
Faceted Searching is a must have feature for enhancing findability and user engagement in enterprise search UI. The Faceted Searching features of Apache Solr have been a major factor in it's popularity, but many Solr users don't fully appreciate all of the capabilities that are available. In this session we will deep dive into the different types of data facets that Solr supports, discussing in detail the various options that can be used to explore them. We will also review some specific techniques for dealing with several complex use cases, and discuss some performance "gotchas" and how to avoid them.
Crunching Data with Google BigQuery. JORDAN TIGANI at Big Data Spain 2012Big Data Spain
Session presented at Big Data Spain 2012 Conference
16th Nov 2012
ETSI Telecomunicacion UPM Madrid
www.bigdataspain.org
More info: http://www.bigdataspain.org/es-2012/conference/crunching-data-with-google-bigquery/jordan-tigani
Data Wars: The Bloody Enterprise strikes backVictor_Cr
I would like to describe such cases when we create problems for "future us" just by an accident. I will show how different Java data types can ease or increase the pain in supporting the application later. Most common pitfals and tricky corner cases you probably have never thought about.
This talk will show what is possible huge datasets that are becoming more prevalent in the era of big data. I will demonstrate this and the 3d visualization in the Jupyter notebook, the by now almost standard environment of (data) scientists.
With large astronomical catalogues containing more than a billion stars becoming common, we are preparing for methods to visualize and explore these large datasets. Data volumes of this size requires different visualization techniques, since scatter plots become too slow and meaningless due to overplotting. We solve the performance and visualization issue using binned statistics, e.g. histograms, density maps, and volume rendering in 3d. The calculation of statistics on N-dimensional grids is handled by Python library called vaex, which I will introduce. It can process at least a billion samples per second, to produce for instance the mean of a quantity on a regular grid. This statistics can be calculated for any mathematical expression on the data (numpy style) and can be on the full dataset or subsets, specified by queries/selections.
However, to visualize higher dimensional data in the notebook interactively, no proper solution existed. This led to the development of ipyvolume, which can render 3d volumes and up to a million glyphs (scatter plots and quiver) in the Jupyter notebook as a widget. With the browser as a platform, and the release of ipywidgets 6.0, these 3d plots can also be embedded in static html files and renders on nbviewer. This allows for sharing with colleagues, rendering on your tablet (paperless office), outreach, press release material, etc. Full screen stereo rendering allows for a virtual reality experience using your phone and Google Cardboard, a minor investment compared to other VR head mountables. Overlaying 3d quiver plots on a 3d volume rendering allows exploring a 6d (or higher) space.
Vaex and ipyvolume can be used together to explore and visualize any large tabular data set, or separately to calculate statistics, and render 3d plots in the notebook and outside.
Similar to Massively Scalable Real-time Geospatial Anomaly Detection with Apache Kafka and Cassandra (ApacheCon 2020) (20)
The Impact of Hardware and Software Version Changes on Apache Kafka Performan...Paul Brebner
Apache Kafka's performance and scalability can be impacted by both hardware and software dimensions. In this presentation, we explore two recent experiences from running a managed Kafka service.
The first example recounts our experiences with running Kafka on AWS's Graviton2 (ARM) instances. We performed extensive benchmarking but didn't initially see the expected performance benefits. We developed multiple hypotheses to explain the unrealized performance improvement, but we could not experimentally determine the cause. We then profiled the Kafka application, and after identifying and confirming a likely cause, we found a workaround and obtained the hoped-for improved price/performance.
The second example explores the ability of Kafka to scale with increasing partitions. We revisit our previous benchmarking experiments with the newest version of Kafka (3.X), which has the option to replace Zookeeper with the new KRaft protocol. We test the theory that Kafka with KRaft can 'scale to millions of partitions' and also provide valuable experimental feedback on how close KRaft is to being production-ready.
Presentation for the ApacheCon NA Performance Engineering Track, October 6, 2022, Sheraton Hotel, New Orleans.
Spinning your Drones with Cadence Workflows and Apache KafkaPaul Brebner
The rapid rise in Big Data use cases over the last decade has been accelerated by popular massively scalable open-source technologies such as Apache Cassandra® for storage, Apache Kafka® for streaming, and OpenSearch® for search. Now there’s a new member of the peloton, Cadence, for orchestration - code-based scalable fault-tolerant workflow orchestration. To illustrate the most important Cadence concepts (and more) we’ll build a realistic drone delivery service demonstration application. We’ll also explore what happens when orchestration meets choreography, and use the drone application to illustrate different ways to integrate Cadence with Apache Kafka, including reusing Kafka microservices. But how scalable is Cadence in practice? We’ll fill the sky with drones - how many drones can we get flying at once?
Change Data Capture (CDC) With Kafka Connect® and the Debezium PostgreSQL Sou...Paul Brebner
Modern event-based/streaming distributed systems embrace the idea that change is inevitable and actually desirable! Without being change-aware, systems are inflexible, can’t evolve or react, and are simply incapable of keeping up with real-time real-world data. But how can we speed up an “Elephant” (PostgreSQL) to be as fast as a “Cheetah” (Kafka)? In this talk, we'll introduce the Debezium PostgreSQL Connector, and explain how to deploy, configure and run it on a Kafka Connect cluster, explore the semantics and format of the change data events (including Schemas and Table/Topic mapping), and test the performance. Finally, we'll show how to stream the change data events into an example downstream system, Elasticsearch, using an open source sink connector.
Presentation for PostgresConf.CN and PGConf.Asia 2021 https://www.highgo.ca/2022/01/19/2021-pg-asia-conference-delivered-another-successful-online-conference-again/
Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
Invited keynote for 5th Workshop on Hot Topics in Cloud Computing Performance (HotCloudPerf 2022) https://hotcloudperf.spec.org/ at ICPE 2022 https://icpe2022.spec.org/
OPEN Talk: Scaling Open Source Big Data Cloud Applications is Easy/HardPaul Brebner
DeveloperWeek Management 2022 Conference Presentation https://www.developerweek.com/global/conference/management/schedule/
In the last decade, the development of modern horizontally scalable open-source Big Data technologies such as Apache Cassandra (for data storage), and Apache Kafka (for data streaming) enabled cost-effective, highly scalable, reliable, low-latency applications, and made these technologies increasingly ubiquitous. To enable reliable horizontal scalability, both Cassandra and Kafka utilize partitioning (for concurrency) and replication (for reliability and availability) across clustered servers. But building scalable applications isn’t as easy as just throwing more servers at the clusters, and unexpected speed humps are common. Consequently, you also need to understand the performance impact of partitions, replication, and clusters; monitor the correct metrics to have an end-to-end view of applications and clusters; conduct careful benchmarking, and scale and tune iteratively to take into account performance insights and optimizations. In this presentation, I will explore some of the performance goals, challenges, solutions, and results I discovered over the last 5 years building multiple realistic demonstration applications. The examples will include trade-offs with elastic Cassandra auto-scaling, scaling a Cassandra and Kafka anomaly detection application to 19 Billion checks per day, and building low-latency streaming data pipelines using Kafka Connect for multiple heterogeneous source and sink systems.
n this Cartoon Style Visual Introduction to Apache Kafka we’re going to build a “Postal Service” to deliver party invitations to two groups, Nerds and Pugsters – find out who goes to the party. Along the way we’ll learn about Kafka Producers, Consumers, Groups, Topics, Partitions, Keys, Records, Delivery Semantics (Guaranteed delivery, and who gets what messages). We’ll also have a quick look at Streams (mail sorting) and Connectors (how does mail get delivered between post offices).
Presentation for Open Source 101 2022: https://opensource101.com/sessions/a-visual-introduction-to-apache-kafka/
Video: https://youtu.be/NUnsHFn52sE
Building a real-time data processing pipeline using Apache Kafka, Kafka Conne...Paul Brebner
With the rapid onset of the global Covid-19 Pandemic from the start of this year the USA Centers for Disease Control and Prevention (CDC) had to quickly implement a new Covid-19 specific pipeline to collect testing data from all of the USA’s states and territories, and carry out other critical steps including integration, cleaning, checking, enrichment, analysis, and enforcing data governance and privacy etc. The pipeline then produces multiple consumable results for federal and public agencies. They did this in under 30 days, using Apache Kafka. In this presentation we'll build a similar (but simpler) pipeline for ingesting, integrating, indexing, searching/analysing and visualising some publicly available tidal data. We'll briefly introduce each technology and component, and walk through the steps of using Apache Kafka, Kafka Connect, Elasticsearch and Kibana to build the pipeline and visualise the results.
Grid Middleware – Principles, Practice and PotentialPaul Brebner
A presentation I gave at UCL, while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich.
Paul Brebner, University College London, Computer Science Department Seminar: "Grid Middleware - Principles, Practice, and Potential", 1 November 2004.
The project page was still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Grid middleware is easy to install, configure, secure, debug and manage acros...Paul Brebner
A presentation made while I was managing the UK OGSA Evaluation Project in 2004, while I was on leave from CSIRO, at UCL Computer Science department, working with Wolfgang Emmerich: in which we "believe 6 impossible things before breakfast". This project encountered and partially solved many of the problems that Cloud computing finally solved.
Paul Brebner, Oxford University Computing Laboratory invited talk: "Grid middleware is easy to install, configure, debug and manage - across multiple sites (One can't believe impossible things)", 15 October 2004.
The project web site is still here (2020): http://sse.cs.ucl.ac.uk/UK-OGSA/
Melbourne Big Data Meetup Talk: Scaling a Real-Time Anomaly Detection Applica...Paul Brebner
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters.
In this presentation, Paul will reveal how he architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream.
It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. Paul will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from his experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
Melbourne Big Data Meetup, March 5 2020
https://www.eventbrite.com/e/melbourne-big-data-meetup-realtime-anomaly-detection-with-cassandra-kafka-tickets-93028445585
0b101000 years of computing: a personal timeline - decade "0", the 1980'sPaul Brebner
With the arrival of the 2020's I realised I've now been involved in Computing for 4 decades. So I probably know more about the past of Computing that I will about the future! Here's a personal timeline of hopefully interesting things from the 1980's in Computing (at Waikato University, NZ, and UNSW in Australia).
ApacheCon Berlin 2019: Kongo:Building a Scalable Streaming IoT Application us...Paul Brebner
Join with me in a journey of exploration upriver with "Kongo", a scalable streaming IoT logistics demonstration application using Apache Kafka, the popular open source distributed streaming platform. Along the way you'll discover: an example logistics IoT problem domain (involving the rapid movement of thousands of goods by trucks between warehouses, with real-time checking of complex business and safety rules from sensor data); an overview of the Apache Kafka architecture and components; lessons learned from making critical Kaka application design decisions; an example of Kafka Streams for checking truck load limits; and finish the journey by overcoming final performance challenges and shooting the rapids to scale Kongo on a production Kafka cluster.
https://aceu19.apachecon.com/session/kongo-building-scalable-streaming-iot-application-using-apache-kafka
Apache Kafka, Apache Cassandra and Kubernetes are open source big data technologies enabling applications and business operations to scale massively and rapidly. While Kafka and Cassandra underpins the data layer of the stack providing capability to stream, disseminate, store and retrieve data at very low latency, Kubernetes is a container orchestration technology that helps in automated application deployment and scaling of application clusters. In this presentation, we will reveal how we architected a massive scale deployment of a streaming data pipeline with Kafka and Cassandra to cater to an example Anomaly detection application running on a Kubernetes cluster and generating and processing massive amount of events. Anomaly detection is a method used to detect unusual events in an event stream. It is widely used in a range of applications such as financial fraud detection, security, threat detection, website user analytics, sensors, IoT, system health monitoring, etc. When such applications operate at massive scale generating millions or billions of events, they impose significant computational, performance and scalability challenges to anomaly detection algorithms and data layer technologies. We will demonstrate the scalability, performance and cost effectiveness of Apache Kafka, Cassandra and Kubernetes, with results from our experiments allowing the Anomaly detection application to scale to 19 Billion anomaly checks per day.
ApacheCon2019 Talk: Improving the Observability of Cassandra, Kafka and Kuber...Paul Brebner
As distributed applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical. Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works. In this presentation we’ll explore two complementary Open Source technologies: Prometheus for monitoring application metrics; and OpenTracing and Jaeger for distributed tracing. We’ll discover how they improve the observability of a massively scalable Anomaly Detection system - an application which is built around Apache Cassandra and Apache Kafka for the data layers, and dynamically deployed and scaled on Kubernetes, a container orchestration technology. We will give an overview of Prometheus and OpenTracing/Jaeger, explain how the application is instrumented, and describe how Prometheus and OpenTracing are deployed and configured in a production environment running Kubernetes, to dynamically monitor the application at scale. We conclude by exploring the benefits of monitoring and tracing technologies for understanding, debugging and tuning complex dynamic distributed systems built on Kafka, Cassandra and Kubernetes, and introduce a new use case to enable Cassandra Elastic Autoscaling, by combining Prometheus alerts, Instaclustr’s Provisioning API for Dynamic Resizing, and the new Prometheus monitoring API.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Introducing Apache Kafka - a visual overview. Presented at the Canberra Big Data Meetup 7 February 2019. We build a Kafka "postal service" to explain the main Kafka concepts, and explain how consumers receive different messages depending on whether there's a key or not.
High School level (years 9-10 in Australia, ages 14-16) introduction to programming course, based on the language Processing, includes class material, exercises, examples, and tests. Course ran for 2 terms in 2014. Feel free to use as is, borrow ideas, etc. 13th and final class, 3d graphics.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
📕 Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
46. Issues?
■ Cardinality for partition key
● should be > 100,000
● >= 4 character geohash
■ Unbounded partitions are bad
● May need composite partition key in
production
● e.g. extra time bucket (hour, day, etc)
■ Space vs time
● could have different sized buckets for
different sized spaces
● E.g. bigger areas with more frequent
events may need shorter time buckets
to limit size
● This may depend on the space-time
scales of underlying
systems/processes
● E.g. Spatial and temporal scales of
oceanographic processes (left)
49. Cassandra Table and
Lucene Indexes
• Geopoint Example
• Under the hood indexing is
done using a tree structure
with geohashes
(configurable precision)
CREATE TABLE latlong_lucene (
geohash1 text,
value double,
time timestamp,
latitude double,
longitude double,
Primary key (geohash1, time)
) WITH CLUSTERING ORDER BY (time DESC);
CREATE CUSTOM INDEX latlong_index ON latlong_lucene ()
USING 'com.stratio.cassandra.lucene.Index'
WITH OPTIONS = {
'refresh_seconds': '1',
'schema': '{
fields: {
geohash1: {type: "string"},
value: {type: "double"},
time: {type: "date", pattern: "yyyy/MM/dd HH:mm:ss.SSS"},
place: {type: "geo_point", latitude: "latitude", longitude: "longitude"}
}'
};
60. Proximity
rules
> 50m from people and
property
>150m from congested
areas
> 1000m from airports
> 5000m from exclusion
zones
Just happen to
correspond to different
length 3D geohashes,