Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010CLOUDIAN KK
This is the summary materials of "Benchmarking Cloud Serving Systems with YCSB" paper for nosql summer reading in Tokyo on September 15, 2010 at Gemini Mobile Technologies in Shibuya, Tokyo.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
In September 2016, the PostgreSQL community is rolling out PostgreSQL 9.6 which includes improvements in parallelism for query performance, overall performance improvements and the integration of foreign data sources.
This presentation introduces the new features of 9.6 and how they will benefit you.
- Parallel sequential scans, joins and aggregates
- Elimination of repetitive scanning of old data by autovacuum
- Synchronous replication now allows multiple standby servers for increased reliability
- Full-text search for phrases
- Support for remote joins, sorts, and updates in postgres_fdw
- Substantial performance improvements, especially in the area of improving scalability on many-CPU servers
If you have any questions on how to get started with Postgres, please email sales@enterprisedb.com
Now that you've seen Base 1.0, what's ahead in HBase 2.0, and beyond—and why? Find out from this panel of people who have designed and/or are working on 2.0 features.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.
Summary of "YCSB " paper for nosql summer reading in Tokyo" on Sep 15, 2010CLOUDIAN KK
This is the summary materials of "Benchmarking Cloud Serving Systems with YCSB" paper for nosql summer reading in Tokyo on September 15, 2010 at Gemini Mobile Technologies in Shibuya, Tokyo.
HBaseCon 2015: Taming GC Pauses for Large Java Heap in HBaseHBaseCon
In this presentation, we will introduce Hotspot's Garbage First collector (G1GC) as the most suitable collector for latency-sensitive applications running with large memory environments. We will first discuss G1GC internal operations and tuning opportunities, and also cover tuning flags that set desired GC pause targets, change adaptive GC thresholds, and adjust GC activities at runtime. We will provide several HBase case studies using Java heaps as large as 100GB that show how to best tune applications to remove unpredicted, protracted GC pauses.
In September 2016, the PostgreSQL community is rolling out PostgreSQL 9.6 which includes improvements in parallelism for query performance, overall performance improvements and the integration of foreign data sources.
This presentation introduces the new features of 9.6 and how they will benefit you.
- Parallel sequential scans, joins and aggregates
- Elimination of repetitive scanning of old data by autovacuum
- Synchronous replication now allows multiple standby servers for increased reliability
- Full-text search for phrases
- Support for remote joins, sorts, and updates in postgres_fdw
- Substantial performance improvements, especially in the area of improving scalability on many-CPU servers
If you have any questions on how to get started with Postgres, please email sales@enterprisedb.com
Now that you've seen Base 1.0, what's ahead in HBase 2.0, and beyond—and why? Find out from this panel of people who have designed and/or are working on 2.0 features.
HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...Cloudera, Inc.
The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
Speaker: Mike Drob
Apache Accumulo has long held a reputation for enabling high-throughput operations in write-heavy workloads. In this talk, we use the Yahoo! Cloud Serving Benchmark (YCSB) to put real numbers on Accumulo performance. We then compare these numbers to previous versions, to other databases, and wrap up with a discussion of parameters that can be tweaked to improve them.
Detail behind the Apache Cassandra 2.0 release and what is new in it including Lightweight Transactions (compare and swap) Eager retries, Improved compaction, Triggers (experimental) and more!
• CQL cursors
The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
If you’re running an HBase cluster in production, you’ve probably noticed that HBase shares a number of useful metrics about everything from your block cache performance to your HDFS latencies over JMX (or Ganglia, or just a file). The problem is that it’s sometimes hard to know what these metrics mean to you and your users. Should you be worried if your memstore SizeMB is 1.5GB? What if your regionservers have a hundred stores each? This talk will explain how to understand and interpret the metrics HBase exports. Along the way we’ll cover some high-level background on HBase’s internals, and share some battle tested rules-of-thumb about how to interpret and react to metrics you might see.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
Cost and performance comparison for OpenStack compute and storage infrastructurePrincipled Technologies
To be competitive in the cloud, businesses need more efficient hardware and software for their OpenStack environment and solutions that can pool physical resources efficiently for consumption and management through OpenStack. Software-defined storage has made it easier to create storage resource pools that spread across your datacenter, but external or separate distributed storage systems are still a challenge for many. VMware vSphere with Virtual SAN converges storage resources with the hypervisor, which can allow for management of an entire infrastructure through a single vCenter Server, increase performance, save space, and reduce costs.
Both solutions used the same twelve drive bays per storage node for virtual disk storage, however the tiered design of VMware Virtual SAN allowed for greater performance in our tests. We used a mix of two SSDs and 10 rotational drives for the VMware Virtual SAN solution, while we used 12 rotational drives for the Red Hat Storage Server solution behind a battery-backed RAID controller, the Red Hat recommended approach.
In our testing, the VMware vSphere with Virtual SAN solution performed better than the Red Hat Storage solution in both real world and raw performance testing by providing 53 percent more database OPS and 159 percent more IOPS. In addition, the vSphere with Virtual SAN solution can occupy less datacenter space, which can result in lower costs associated with density. A three-year cost projection for the two solutions showed that VMware vSphere with Virtual SAN could save your business up to 26 percent in hardware and software costs when compared to the Red Hat Storage solution we tested.
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
Eshcar Hillel and Anastasia Braginsky (Yahoo!)
Real-time HBase application performance depends critically on the amount of I/O in the datapath. Here we’ll describe an optimization of HBase for high-churn applications that frequently insert/update/delete the same keys, such as for high-speed queuing and e-commerce.
ClustrixDB 7.5 is the latest release of the only drop-in replacement for MySQL with true scale-out performance. The latest release of ClustrixDB is easier to use, provides more insight into the performance of the database and better utilizes hardware.
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Principled Technologies
For companies in need of a comprehensive strategy for containers and software-defined storage, Red Hat Container Ready Storage paired with Red Hat OpenShift Container Platform offer a solution that allows them to leverage their investment in VMware vSphere. In our proof-of-concept study, we explored the scaling capabilities of a CNS implementation using two types of Western Digital storage media, Ultrastar He10 hard drives and the new Ultrastar SS200 solid-state drives. We tested the solutions under a variety of conditions, using both IO-intensive and CPU-intensive workloads, multiple vCPU allocation counts, and a range of quantities of app instances. In this document, we have presented some of the many resulting data points, including price/performance metrics, which have the potential to assist IT professionals implementing CNS to meet the unique needs of their businesses.
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
EmoDB is an open source RESTful data store built on top of Cassandra that stores JSON documents and, most notably, offers a databus that allows subscribers to watch for changes to those documents in real time. It features massive non-blocking global writes, asynchronous cross data center communication, and schema-less json content.
For non-blocking global writes, we created a ""JSON delta"" specification that defines incremental updates to any json document. Each row, in Cassandra, is thus a sequence of deltas that serves as a Conflict-free Replicated Datatype (CRDT) for EmoDB's system of record. We introduce the concept of ""distributed compactions"" to frequently compact these deltas for efficient reads.
Finally, the databus forms a crucial piece of our data infrastructure and offers a change queue to real time streaming applications.
About the Speaker
Fahd Siddiqui Lead Software Engineer, Bazaarvoice
Fahd Siddiqui is a Lead Software Engineer at Bazaarvoice in the data infrastructure team. His interests include highly scalable, and distributed data systems. He holds a Master's degree in Computer Engineering from the University of Texas at Austin, and frequently talks at Austin C* User Group. About Bazaarvoice: Bazaarvoice is a network that connects brands and retailers to the authentic voices of people where they shop. More at www.bazaarvoice.com
We run multiple DataStax Enterprise clusters in Azure each holding 300 TB+ data to deeply understand Office 365 users. In this talk, we will deep dive into some of the key challenges and takeaways faced in running these clusters reliably over a year. To name a few: process crashes, ephemeral SSDs contributing to data loss, slow streaming between nodes, mutation drops, compaction strategy choices, schema updates when nodes are down and backup/restore. We will briefly talk about our contributions back to Cassandra, and our path forward using network attached disks offered via Azure premium storage.
About the Speaker
Anubhav Kale Sr. Software Engineer, Microsoft
Anubhav is a senior software engineer at Microsoft. His team is responsible for building big data platform using Cassandra, Spark and Azure to generate per-user insights of Office 365 users.
Accumulo Summit 2014: Benchmarking Accumulo: How Fast Is Fast?Accumulo Summit
Speaker: Mike Drob
Apache Accumulo has long held a reputation for enabling high-throughput operations in write-heavy workloads. In this talk, we use the Yahoo! Cloud Serving Benchmark (YCSB) to put real numbers on Accumulo performance. We then compare these numbers to previous versions, to other databases, and wrap up with a discussion of parameters that can be tweaked to improve them.
Detail behind the Apache Cassandra 2.0 release and what is new in it including Lightweight Transactions (compare and swap) Eager retries, Improved compaction, Triggers (experimental) and more!
• CQL cursors
The Google File System (GFS) presented in 2003 is the inspiration for the Hadoop Distributed File System (HDFS). Let's take a deep dive into GFS to better understand Hadoop.
Cassandra is a highly scalable, eventually consistent, distributed, structured columnfamily store with no single points of failure, initially open-sourced by Facebook and now part of the Apache Incubator. These slides are from Jonathan Ellis's OSCON 09 talk: http://en.oreilly.com/oscon2009/public/schedule/detail/7975
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaCloudera, Inc.
If you’re running an HBase cluster in production, you’ve probably noticed that HBase shares a number of useful metrics about everything from your block cache performance to your HDFS latencies over JMX (or Ganglia, or just a file). The problem is that it’s sometimes hard to know what these metrics mean to you and your users. Should you be worried if your memstore SizeMB is 1.5GB? What if your regionservers have a hundred stores each? This talk will explain how to understand and interpret the metrics HBase exports. Along the way we’ll cover some high-level background on HBase’s internals, and share some battle tested rules-of-thumb about how to interpret and react to metrics you might see.
Co-Founder and CTO of Instaclustr, Ben Bromhead's presentation at the Cassandra Summit 2016, in San Jose.
This presentation will show how create truly elastic Cassandra deployments on AWS allowing you to scale and shrink your large Cassandra deployments multiple times a day. Leveraging a combination of EBS backed disks, JBOD, token pinning and our previous work on bootstrapping from backups you will be able to dramatically reduce costs per cluster by scaling to match your daily workloads.
This presentation will recount the story of Macys.com (and Bloomingdales.com)'s selection and migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
We'll start with a mercifully brief backgrounder on our website and our business. Then we will go over the various technologies that we considered, as well as our use case-based performance benchmarks that led to the decision to go with Cassandra.
We'll cover the various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
And, finally, we will wrap up with our "lessons learned" and a brief look at our future plans.
Cost and performance comparison for OpenStack compute and storage infrastructurePrincipled Technologies
To be competitive in the cloud, businesses need more efficient hardware and software for their OpenStack environment and solutions that can pool physical resources efficiently for consumption and management through OpenStack. Software-defined storage has made it easier to create storage resource pools that spread across your datacenter, but external or separate distributed storage systems are still a challenge for many. VMware vSphere with Virtual SAN converges storage resources with the hypervisor, which can allow for management of an entire infrastructure through a single vCenter Server, increase performance, save space, and reduce costs.
Both solutions used the same twelve drive bays per storage node for virtual disk storage, however the tiered design of VMware Virtual SAN allowed for greater performance in our tests. We used a mix of two SSDs and 10 rotational drives for the VMware Virtual SAN solution, while we used 12 rotational drives for the Red Hat Storage Server solution behind a battery-backed RAID controller, the Red Hat recommended approach.
In our testing, the VMware vSphere with Virtual SAN solution performed better than the Red Hat Storage solution in both real world and raw performance testing by providing 53 percent more database OPS and 159 percent more IOPS. In addition, the vSphere with Virtual SAN solution can occupy less datacenter space, which can result in lower costs associated with density. A three-year cost projection for the two solutions showed that VMware vSphere with Virtual SAN could save your business up to 26 percent in hardware and software costs when compared to the Red Hat Storage solution we tested.
Apache HBase, Accelerated: In-Memory Flush and Compaction HBaseCon
Eshcar Hillel and Anastasia Braginsky (Yahoo!)
Real-time HBase application performance depends critically on the amount of I/O in the datapath. Here we’ll describe an optimization of HBase for high-churn applications that frequently insert/update/delete the same keys, such as for high-speed queuing and e-commerce.
ClustrixDB 7.5 is the latest release of the only drop-in replacement for MySQL with true scale-out performance. The latest release of ClustrixDB is easier to use, provides more insight into the performance of the database and better utilizes hardware.
Performance of persistent apps on Container-Native Storage for Red Hat OpenSh...Principled Technologies
For companies in need of a comprehensive strategy for containers and software-defined storage, Red Hat Container Ready Storage paired with Red Hat OpenShift Container Platform offer a solution that allows them to leverage their investment in VMware vSphere. In our proof-of-concept study, we explored the scaling capabilities of a CNS implementation using two types of Western Digital storage media, Ultrastar He10 hard drives and the new Ultrastar SS200 solid-state drives. We tested the solutions under a variety of conditions, using both IO-intensive and CPU-intensive workloads, multiple vCPU allocation counts, and a range of quantities of app instances. In this document, we have presented some of the many resulting data points, including price/performance metrics, which have the potential to assist IT professionals implementing CNS to meet the unique needs of their businesses.
One Billion Black Friday Shoppers on a Distributed Data Store (Fahd Siddiqui,...DataStax
EmoDB is an open source RESTful data store built on top of Cassandra that stores JSON documents and, most notably, offers a databus that allows subscribers to watch for changes to those documents in real time. It features massive non-blocking global writes, asynchronous cross data center communication, and schema-less json content.
For non-blocking global writes, we created a ""JSON delta"" specification that defines incremental updates to any json document. Each row, in Cassandra, is thus a sequence of deltas that serves as a Conflict-free Replicated Datatype (CRDT) for EmoDB's system of record. We introduce the concept of ""distributed compactions"" to frequently compact these deltas for efficient reads.
Finally, the databus forms a crucial piece of our data infrastructure and offers a change queue to real time streaming applications.
About the Speaker
Fahd Siddiqui Lead Software Engineer, Bazaarvoice
Fahd Siddiqui is a Lead Software Engineer at Bazaarvoice in the data infrastructure team. His interests include highly scalable, and distributed data systems. He holds a Master's degree in Computer Engineering from the University of Texas at Austin, and frequently talks at Austin C* User Group. About Bazaarvoice: Bazaarvoice is a network that connects brands and retailers to the authentic voices of people where they shop. More at www.bazaarvoice.com
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Accumulo Summit 2015: Ferrari on a Bumpy Road: Shock Absorbers to Smooth Out ...Accumulo Summit
Talk Abstract
Accumulo has a solid theoretical foundation, endowing it with huge scalability, high reliability, and the makings of class-leading performance for NoSQL operations. Several publications show Accumulo achieving multi-petabyte scalability and outperforming other databases in its class by orders of magnitude. However, there are challenges arising in practice that slow down that performance and introduce bottlenecks.
The root of Accumulo's distributed scale and performance while maintaining consistency comes from a multi-level amplification. Zookeeper bootstraps the consistency with a highly durable quorum. The Accumulo root table uses buffering and caching to boost that performance for sorted key/value operations. With the metadata tablets and data tables, Accumulo continues to boost performance and divides and conqures a highly scalable key/value space to leverage the resources of a large cluster. The challenge arrises when metadata operations at the core of Accumulo bottleneck performance for the entire cluster.
In this talk we will describe the Accumulo metadata operations model in detail. With a couple of prototypical application scenarios, we will show a few areas that are current bottlenecks or that we can expect to be bottlenecks in the near future. We will also propose modifications to the current model and outline projects that the community can take on to keep Accumulo in the lead for performance and scalability.
Speaker
Adam Fuchs
Chief Technology Officer, Sqrrl
As the Chief Technology Officer and co-founder of Sqrrl, Adam Fuchs is responsible for ensuring that Sqrrl is leading the world in Big Data Infrastructure technology. Previously at the National Security Agency, Adam was an innovator and technical director for several database projects, handling some of the world’s largest and most diverse data sets. He is a co-founder of the Apache Accumulo project. Adam has a BS in Computer Science from the University of Washington and has completed extensive graduate-level course work at the University of Maryland.
(DAT402) Amazon RDS PostgreSQL:Lessons Learned & New FeaturesAmazon Web Services
Learn the specifics of Amazon RDS for PostgreSQL’s capabilities and extensions that make it powerful. This session begins with a brief overview of the RDS PostgreSQL service, how it provides High Availability & Durability and will then deep dive into the new features that we have released since re:Invent 2014, including major version upgrade and newly added PostgreSQL extensions to RDS PostgreSQL. During the session, we will also discuss lessons learned running a large fleet of PostgreSQL instances, including specific recommendations. In addition we will present benchmarking results looking at differences between the 9.3, 9.4 and 9.5 releases.
C* Summit 2013: CMB: An Open Message Bus for the Cloud by Boris WolfDataStax Academy
The Comcast Silicon Valley Innovation Center has developed a general purpose message bus for the cloud. The service is API compatible with Amazon's SQS/SNS and is built on Cassandra and Redis with the goal of linear horizontal scalability. This presentation offers and in-depth look at the architecture of the system and how they employ Cassandra as a central component to meet key requirements. Latest feature enhancements and performance data will also be covered.
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.
GECon2017_High-volume data streaming in azure_ Aliaksandr LaishaGECon_Org Team
The session will be focused on solutions that require high-throughput ingestion & streaming of data in real-time. You'll get familiar with different business uses-cases and architecture examples to get a common idea as well as understand the concepts of stream processing systems. Next, you'll get deep insights into functional and non-functional capabilities of Azure Event Hub service to see how it fits into the whole picture. Moreover we’ll take a look how to leverage Azure CosmosDB for high-throughput streaming when Event Hub is not suitable by different reasons.
Data on the Move: Transitioning from a Legacy Architecture to a Big Data Plat...MapR Technologies
Atzmon Hen-Tov & Lior Schachter, Pontis
Businesses everywhere are increasingly challenged by their dependencies on legacy platforms. The dramatic increase in data volume, speed, and types of data is quickly outstripping the capabilities of these legacy systems. By transitioning from a legacy RDBMS to a Hadoop-based platform, Pontis was able to process and analyze billions of mobile subscriber events every day. In this talk, we’ll provide a quick overview of our legacy system, as well as our process for migrating to our target architecture. We’ll continue with a review our Hadoop platform selection process, which involved a thorough RFP and a detailed analysis of the top Hadoop platform vendors. This session will focus on how we gradually transitioned to our big data platform over the course of several product versions, resulting in higher scalability and a lower TCO in each version. We’ll outline the benefits of the target architecture, and detail how we successfully integrated Hadoop into our organization. Our session will conclude with a look at technical solutions for dealing with big data deficiencies.
Leveraging Threat Intelligence to Guide Your HuntsSqrrl
This webinar training session covers everything from what threat intelligence is to specific examples of how to hunt with it; applying intel during a tactical hunt and what you should be looking out for when searching for adversaries on your enterprise network. Taught by Keith Gilbert, Keith is an experienced threat researcher with a background in Digital Forensics and Incident Response.
How to Hunt for Lateral Movement on Your NetworkSqrrl
Once inside your network, most cyber-attacks go sideways. They progressively move deeper into the network, laterally compromising other systems as they search for key assets and data. Would you spot this lateral movement on your enterprise network?
In this training session, we review the various techniques attackers use to spread through a network, which data sets you can use to reliably find them, and how data science techniques can be used to help automate the detection of lateral movement.
Machine Learning for Incident Detection: Getting StartedSqrrl
This presentation walks you through the uses of machine learning in incident detection and response, outlining some of the basic features of machine learning and specific tools you can use.
Watch the presentation with audio here: https://www.youtube.com/watch?v=4pArapSIu_w
Building a Next-Generation Security Operations Center (SOC)Sqrrl
So, you need to build a Security Operations Center (SOC)? What does that mean? What does the modern SOC need to do? Learn from Dr. Terry Brugger, who has been doing information security work for over 15 years, including building out a SOC for a large Federal agency and consulting for numerous large enterprises on their security operations.
Watch the presentation with audio here: http://info.sqrrl.com/sqrrl-october-webinar-next-generation-soc
User and Entity Behavior Analytics using the Sqrrl Behavior GraphSqrrl
UEBA leverages advanced statistical techniques and machine learning to surface subtle behaviors that are indicative of attacker presence. In this presentation, Sqrrl's Director of Data Science, Chris McCubbin, and Sqrrl's Director of Products, Joe Travaglini, provide an overview of how machine learning and UEBA can be used to detect cyber threats using Sqrrl's Behavior Graph.
Watch the presentation with audio here: http://info.sqrrl.com/april-2016-ueba-webinar-on-demand
Threat Hunting Platforms (Collaboration with SANS Institute)Sqrrl
Traditional security measures like firewalls, IDS, endpoint protection, and SIEMs are only part of the network security puzzle. Threat hunting is a proactive approach to uncovering threats that lie hidden in your network or system, that can evade more traditional security tools. Go in-depth with Sqrrl and SANS Institute to learn how hunting platforms work.
Watch the recording with audio here: http://info.sqrrl.com/sans-sqrrl-threat-hunting-webcast
Sqrrl and IBM: Threat Hunting for QRadar UsersSqrrl
This joint webinar, in collaboration with IBM, offers a look at the industry leading Threat Hunting App for IBM QRadar. By combining the threat detection capabilities of QRadar and Sqrrl, security analysts are armed with advanced analytics and visualization to hunt for unknown threats and more efficiently investigate known incidents.
Watch the training with audio here: http://info.sqrrl.com/sqrrl-ibm-threat-hunting-for-qradar-users
Threat Hunting for Command and Control ActivitySqrrl
Sqrrl's Security Technologist Josh Liburdi provides an overview of how to detect C2 through a combination of automated detection and hunting.
Watch the presentation with audio here: http://info.sqrrl.com/threat-hunting-for-command-and-control-activity
Today's threats demand a more active role in detecting and isolating sophisticated attacks. This must-see presentation provides practical guidance on modernizing your SOC and building out an effective threat hunting program. Ed Amoroso and David Bianco discuss best practices for developing and staffing a modern SOC, including the essential shifts in how to think about threat detection.
Watch the presentation with audio here: http://info.sqrrl.com/webinar-modernizing-your-security-operations
Threat Hunting vs. UEBA: Similarities, Differences, and How They Work Together Sqrrl
This presentation explains how security teams can leverage hunting and analytics to detect advanced threats faster, more reliably, and with common analyst skill sets. Watch the presentation with audio here: http://info.sqrrl.com/threat-hunting-and-ueba-webinar
In this training session, two leading security experts review how adversaries use DNS to achieve their mission, how to use DNS data as a starting point for launching an investigation, the data science behind automated detection of DNS-based malicious techniques and how DNS tunneling and DGA machine learning algorithms work.
Watch the presentation with audio here: http://info.sqrrl.com/leveraging-dns-for-proactive-investigations
If you follow the trade press, one theme you hear over and over again is that organizations are drowning in alerts. It’s true that we need technological solutions to prioritize and escalate the most important alerts to our analysts, but the humans have a critical part to play in this process as well. The quicker they are able to make decisions about the alerts they review, the better they are able to keep up. An incident responders’ most common task is alert triage, the process of investigation and escalation that ultimately results in the creation of security incidents. As crucial as this process is, there has been remarkably little written about how to do it correctly and efficiently. In this presentation, learn incident response best practices from Sqrrl security expert, David Bianco.
Slides from the webinar led by Ely Kahn and Luis Maldonado discussing strategies to reduce Mean Time to Know in detecting cybersecurity attacks, threats, or data breaches.
Sqrrl Enterprise: Big Data Security Analytics Use CaseSqrrl
Organizations are utilizing Sqrrl Enterprise to securely integrate vast amounts of multi-structured data (e.g., tens of petabytes) onto a single Big Data platform and then are building real-time applications using this data and Sqrrl Enterprise’s analytical interfaces. The secure integration is enabled by Accumulo’s innovative cell-level security capabilities and Sqrrl Enterprise’s security extensions, such as encryption.
Benchmarking The Apache Accumulo Distributed Key–Value StoreSqrrl
This paper presents results of benchmarking Apache Accumulo distributed table store using the continuous tests suite included in its open source distribution.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
2. Scalable table stores are critical systems
Swapnil Patil, CMU 2
• For data processing & analysis (e.g. Pregel, Hive)
• For systems services (e.g., Google Colossus metadata)
3. Evolution of scalable table stores
Simple, lightweight complex, feature-rich stores
Supports a broader range of applications and services
Hard to debug and understand performance problems
Complex behavior and interaction of various components
GrowingsetofHBasefeatures
2008 2009 2010 2011+
RangeRowFilters
Batch updates
Bulk load tools
RegEx filtering
Scan optimizations
HBASE
release
Co-processors
Access Control
⏏
⏏
⏏
3Swapnil Patil, CMU
4. YCSB++ FUNCTIONALITY
ZooKeeper-based distributed and coordinated testing
API and extensions the new Apache ACCUMULO DB
Fine-grained, correlated monitoring usingOTUS
FEATURES TESTED USING YCSB++
Batch writing Table pre-splitting Bulk loading
Weak consistency Server-side filtering Fine-grained security
Tool released at http://www.pdl.cmu.edu/ycsb++
Swapnil Patil, CMU 4
Need richer tools for understanding
advanced features in table stores …
5. Outline
• Problem
• YCSB++ design
• Illustrative examples
• Ongoing work and summary
Swapnil Patil, CMU 5
6. Yahoo Cloud Serving Benchmark [Cooper2010]
Swapnil Patil, CMU 6
• For CRUD (create-read-update-delete) benchmarking
• Single-node system with an extensible API
Storage Servers
HBASE
OTHER
DBS
Workload
Executor
Threads
Stats
DBClients
Command-line
Parameters
Workload
Parameter
File
7. YCSB++: New extensions
Swapnil Patil, CMU 7
Added support for the new Apache ACCUMULO DB
− New parameters and workload executors
Storage Servers
HBASE
OTHER
DBS
Workload
Executor
Threads
Stats
DBClients
Workload
Parameter
File
Command-line
Parameters
EXTENSIONSEXTENSIONS ACCUMULO
8. YCSB++: Distributed & parallel tests
Swapnil Patil, CMU 8
Multi-client, multi-phase coordination using ZooKeeper
− Enables testing at large scales and testing asymmetric features
Storage Servers
HBASE
OTHER
DBS
Workload
Executor
Threads
Stats
DBClients
Workload
Parameter
File
Command-line
Parameters
EXTENSIONSEXTENSIONS
MULTI-PHASE
ACCUMULO
YCSB clients
COORDINATION
9. YCSB++: Collective monitoring
Swapnil Patil, CMU 9
OTUS monitor built on Ganglia [Ren2011]
− Collects information fromYCSB, table stores, HDFS and OS
Storage Servers
HBASE
OTHER
DBS
Workload
Executor
Threads
Stats
DBClients
Workload
Parameter
File
Command-line
Parameters
EXTENSIONSEXTENSIONS
MULTI-PHASE
ACCUMULO
YCSB clients
COORDINATION OTUS MONITORING
10. Example ofYCSB++ debugging
Swapnil Patil, CMU 10
OTUS collects fine-grained information
− Both HDFS process andTabletServer process on same node
0
20
40
60
80
100
00:00 04:00 08:00 12:00 16:00 20:00 00:00 04:00
0
8
16
24
32
40
CPUUsage(%)
AvgNumberofStoreFilesPerTablet
Time (Minutes)
Monitoring Resource Usage and TableStore Metrics
Accumulo Avg. StoreFiles per Tablet
HDFS DataNode CPU Usage
Accumulo TabletServer CPU Usage
11. Outline
• Problem
• YCSB++ design
• Illustrative examples
− YCSB++ on HBASE and ACCUMULO (Bigtable-like stores)
• Ongoing work and summary
Swapnil Patil, CMU 11
12. Tablet Servers
Recap of Bigtable-like table stores
Swapnil Patil, CMU 12
HDFS nodes
TabletTN
Memtable
(Fewer)
Sorted
Indexed
Files
Sorted
Indexed
Files
MINOR
COMPACTION
MAJOR
COMPACTION
Write
Ahead
Log
Data
Insertion
1 2
3
Write-path: in-memory buffering & async FS writes
1) Mutations logged in memory tables (unsorted order)
2) Minor compaction: Memtables -> sorted, indexed files in HDFS
3) Major compaction: LSM-tree based file merging in background
Read-path: lookup both memtables and on-disk files
13. Apache ACCUMULO
Started at NSA; now an Apache Incubator project
− Designed for for high-speed ingest and scan workloads
− http://incubator.apache.org/projects/accumulo.html
New features in ACCUMULO
− Iterator framework for user-specified programs placed in
between different stages of the DB pipeline
E.g., Support joins and stream processing using iterators
− Also supports fine-grained cell-level access control
Swapnil Patil, CMU 13
15. Client-side batch writing
Feature: clients batch inserts, delay writes to server
• Improves insert throughput and latency
• Newly inserted data may not be immediately visible to
other clients
Swapnil Patil, CMU 15
⏏
⏏
Table store servers
ZooKeeper
Cluster
Manager
YCSB++
Store client
Batch
YCSB++
Store client
CLIENT #1 CLIENT #2
Read{K}
16. Batch writing improves throughput
6 clients creating 9 million 1-Kbyte records on 6 servers
− Small batches - high client CPU utilization, limits throughput
− Large batches - saturate servers, limited benefit from batching
Swapnil Patil, CMU 16
0
10
20
30
40
50
60
10 KB 100 KB 1 MB 10 MB
Insertspersecond(1000s)
Batch size
Hbase Accumulo
17. Table store servers
ZooKeeper
Batch writing causes weak consistency
Swapnil Patil, CMU 17
Test setup: ZooKeeper-based client coordination
• Share producer-consumer queue between readers/writers
• R-W lag = delay before C2 can read C1’s most recent write
YCSB++
Store client
Batch
YCSB++
Store client
1
2 3
4
CLIENT #1 CLIENT #2
Insert
{K:V}
(106 records)
EnqueueK
(sample 1% records)
Polland
dequeueK
Read{K}
18. Batch writing causes weak consistency
Deferred write wins, but lag can be ~100 seconds
− (N%) = fraction of requests that needed multiple read()s
− Implementation of batching affects the median latency
Swapnil Patil, CMU 18
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Fractionofrequests
read-after-write time lag (ms)
(a) HBase: Time lag for different buffer sizes
10 KB ( <1%)
100 KB (7.4%)
1 MB ( 17%)
10 MB ( 23%)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 10 100 1000 10000 100000
Fractionofrequests
read-after-write time lag (ms)
(b) Accumulo: Time lag for different buffer sizes
10 KB ( <1%)
100 KB (1.2%)
1 MB ( 14%)
10 MB ( 33%)
20. Features for high-speed insertions
Most table stores have high-speed ingest features
− Periodically insert large amounts of data or migrate old
data in bulk
− Classic relational DB techniques applied to new stores
Two features: bulk loading and table pre-splitting
• Less data migration during inserts
• Engages more tablet servers immediately
• Need careful tuning and configuration [Sasha2002]
Swapnil Patil, CMU 20
⏏
⏏
⏏
21. 8-phase test setup: table bulk loading
Bulk loading involves two steps
− Hadoop-based data formatting
− Importing store files into table store
Pre-load phase (1 and 2)
− Bulk load 6M rows in an empty table
− Goal: parallelism by engaging all servers
Load phase (4 and 5)
− Load 48M new rows
− Goal: study rebalancing during ingest
R/U measurements (3, 6 and 7)
− Correlate latency with rebalancing work
Swapnil Patil, CMU 21
Load (importing)
Read/Update workload
Load (re-formatting)
Read/Update workload
Sleep
Read/Update workload
Pre-Load (importing)
Pre-Load (re-formatting)
Phases
1
2
3
4
5
6
7
8
22. Read latency affected by rebalancing work
Swapnil Patil, CMU 22
Load (importing)
Read/Update workload
Load (re-formatting)
Read/Update workload
Sleep
Read/Update workload
Pre-Load (importing)
Pre-Load (re-formatting)
Phases
1
2
3
4
5
6
7
8
1
10
100
1000
0 60 120 180 240 300
AccumuloReadLatency(ms)
Measurement Phase RunningTime (Seconds)
R/U 1 (Phase 3) R/U 2 (Phase 6) R/U 3 (Phase 8)
• High latency after high insertion periods that
cause servers to rebalance (compactions)
• Latency drops after store is in a steady state
25. Extending to table pre-splitting
Swapnil Patil, CMU 25
Tablepre-splittingtest
Load
Pre-load
Pre-split into N ranges
Read/Update workload
Sleep
Read/Update workload
Load (importing)
Read/Update workload
Load (re-formatting)
Read/Update workload
Sleep
Read/Update workload
Pre-Load (importing)
Pre-Load (re-formatting)
Bulkloadingtest
Pre-split a key range into N partitions to avoid splitting during insertion
26. Outline
• Problem
• YCSB++ design
• Illustrative examples
• Ongoing work and summary
Swapnil Patil, CMU 26
27. Things not covered in this talk
More features: function shipping to servers
− Data filtering at the servers
− Fine-grained, cell-level access control
MoredetailsintheACMSOCC2011paper
Ongoing work
− Analyze more table stores: Cassandra,CouchDB, MongoDB
− Continue research through the new Intel Science and
Technology Center for Cloud Computing at CMU (withGaTech)
Swapnil Patil, CMU 27
28. Summary:YCSB++ tool
• Tool for performance debugging and benchmarking
advanced features using new extensions toYCSB
• Two case-studies: Apache HBASE and ACCUMULO
• Tool available at http://www.pdl.cmu.edu/ycsb++
Weak consistency semantics Distributed clients using ZooKeeper
Fast insertions (pre-splits & bulk loads) Multi-phase testing (with Hadoop)
Server-side filtering New workload generators and
database client API extensionsFine-grained access control
28Swapnil Patil, CMU