That's the slides of my half day workshop at the EFS'11 in Stuttgart where I covered some theoretical aspects of NoSQL data stores relevant for dealing with large data amounts
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
GNU poke is a new interactive editor for binary data. Not limited to editing basic ntities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. Once a user has defined a structure for binary data (usually matching some file format) she can search, inspect, create, shuffle and modify abstract entities such as ELF relocations, MP3 tags, DWARF expressions, partition table entries, and so on, with primitives resembling simple editing of bits and bytes. The program comes with a library of already written descriptions (or “pickles” in poke parlance) for many binary formats.
GNU poke is useful in many domains. It is very well suited to aid in the development of programs that operate on binary files, such as assemblers and linkers. This was in fact the primary inspiration that brought me to write it: easily injecting flaws into ELF files in order to reproduce toolchain bugs. Also, due to its flexibility, poke is also very useful for reverse engineering, where the real structure of the data being edited is discovered by experiment, interactively. It is also good for the fast development of prototypes for programs like linkers, compressors or filters, and it provides a convenient foundation to write other utilities such as diff and patch tools for binary files.
This talk (unlike Gaul) is divided into four parts. First I will introduce the program and show what it does: from simple bits/bytes editing to user-defined structures. Then I will show some of the internals, and how poke is implemented. The third block will cover the way of using Poke to describe user data, which is to say the art of writing “pickles”. The presentation ends with a status of the project, a call for hackers, and a hint at future works.
Jose E. Marchesi
The reasons why 64-bit programs require more stack memoryPVS-Studio
In forums, people often say that 64-bit versions of programs consume a larger amount of memory and stack. Saying so, they usually argue that the sizes of data have become twice larger. But this statement is unfounded since the size of most types (char, short, int, float) in the C/C++ language remains the same on 64-bit systems. Of course, for instance, the size of a pointer has increased but far not all the data in a program consist of pointers. The reasons why the memory amount consumed by programs has increased are more complex. I decided to investigate this issue in detail.
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
io_uring provides a new asynchronous I/O interface in Linux that aims to address limitations with existing interfaces like aio and libaio. It uses a ring-based model for submission and completion queues to efficiently support asynchronous I/O operations with low latency and high throughput. Though initially skeptical, Linus Torvalds ultimately merged io_uring into the Linux kernel due to improvements in missing features, ease of use, and efficiency over alternatives.
1. The document discusses the Advanced Encryption Standard (AES) cipher, which was selected from the Rijndael algorithm in 2000 to replace the Data Encryption Standard (DES).
2. AES has a block size of 128 bits, with key sizes of 128, 192, or 256 bits. It operates on a 4x4 column-byte state and consists of 10-14 rounds depending on the key size.
3. Each round performs byte substitution, shifting rows of the state, mixing columns using matrix multiplication, and adding the round key using XOR. The key is expanded using XOR and S-boxes to generate round keys.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
Cassandra is a structured storage system designed for large amounts of data across commodity servers. It provides high availability with eventual consistency and scales incrementally without centralized administration. Data is partitioned across nodes and replicated for fault tolerance. Writes are applied locally and propagated asynchronously, prioritizing availability over consistency. It uses a gossip protocol for membership and failure detection.
This document discusses Python and web frameworks. It begins with an introduction to Python and its advantages for web development. It then discusses several popular Python web frameworks including web.py, Flask, and Django. It also covers related topics like WSGI, templating with Jinja2, asynchronous programming, and deployment with virtualenv.
Kernel Recipes 2019 - GNU poke, an extensible editor for structured binary dataAnne Nicolas
GNU poke is a new interactive editor for binary data. Not limited to editing basic ntities such as bits and bytes, it provides a full-fledged procedural, interactive programming language designed to describe data structures and to operate on them. Once a user has defined a structure for binary data (usually matching some file format) she can search, inspect, create, shuffle and modify abstract entities such as ELF relocations, MP3 tags, DWARF expressions, partition table entries, and so on, with primitives resembling simple editing of bits and bytes. The program comes with a library of already written descriptions (or “pickles” in poke parlance) for many binary formats.
GNU poke is useful in many domains. It is very well suited to aid in the development of programs that operate on binary files, such as assemblers and linkers. This was in fact the primary inspiration that brought me to write it: easily injecting flaws into ELF files in order to reproduce toolchain bugs. Also, due to its flexibility, poke is also very useful for reverse engineering, where the real structure of the data being edited is discovered by experiment, interactively. It is also good for the fast development of prototypes for programs like linkers, compressors or filters, and it provides a convenient foundation to write other utilities such as diff and patch tools for binary files.
This talk (unlike Gaul) is divided into four parts. First I will introduce the program and show what it does: from simple bits/bytes editing to user-defined structures. Then I will show some of the internals, and how poke is implemented. The third block will cover the way of using Poke to describe user data, which is to say the art of writing “pickles”. The presentation ends with a status of the project, a call for hackers, and a hint at future works.
Jose E. Marchesi
The reasons why 64-bit programs require more stack memoryPVS-Studio
In forums, people often say that 64-bit versions of programs consume a larger amount of memory and stack. Saying so, they usually argue that the sizes of data have become twice larger. But this statement is unfounded since the size of most types (char, short, int, float) in the C/C++ language remains the same on 64-bit systems. Of course, for instance, the size of a pointer has increased but far not all the data in a program consist of pointers. The reasons why the memory amount consumed by programs has increased are more complex. I decided to investigate this issue in detail.
Kernel Recipes 2019 - Faster IO through io_uringAnne Nicolas
io_uring provides a new asynchronous I/O interface in Linux that aims to address limitations with existing interfaces like aio and libaio. It uses a ring-based model for submission and completion queues to efficiently support asynchronous I/O operations with low latency and high throughput. Though initially skeptical, Linus Torvalds ultimately merged io_uring into the Linux kernel due to improvements in missing features, ease of use, and efficiency over alternatives.
1. The document discusses the Advanced Encryption Standard (AES) cipher, which was selected from the Rijndael algorithm in 2000 to replace the Data Encryption Standard (DES).
2. AES has a block size of 128 bits, with key sizes of 128, 192, or 256 bits. It operates on a 4x4 column-byte state and consists of 10-14 rounds depending on the key size.
3. Each round performs byte substitution, shifting rows of the state, mixing columns using matrix multiplication, and adding the round key using XOR. The key is expanded using XOR and S-boxes to generate round keys.
This document provides an introduction and overview of Cassandra including:
- Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008.
- Key features of Cassandra including linear scalability, continuous availability, ability to span multiple data centers, and operational simplicity.
- A high-level overview of Cassandra's architecture including its use of Dynamo and BigTable papers for the cluster and data storage layers.
- Concepts related to Cassandra's data model including data distribution, token ranges, replication, write path, and "last write wins" consistency.
Cassandra is a structured storage system designed for large amounts of data across commodity servers. It provides high availability with eventual consistency and scales incrementally without centralized administration. Data is partitioned across nodes and replicated for fault tolerance. Writes are applied locally and propagated asynchronously, prioritizing availability over consistency. It uses a gossip protocol for membership and failure detection.
This document discusses Python and web frameworks. It begins with an introduction to Python and its advantages for web development. It then discusses several popular Python web frameworks including web.py, Flask, and Django. It also covers related topics like WSGI, templating with Jinja2, asynchronous programming, and deployment with virtualenv.
NoSQL addresses issues related to large volumes of data, including poorly structured data, simplicity of data management, frequent reads and writes, big data streams, huge data storage needs, fast data filtering, complex relationships, and real-time processing and analysis. It works by chopping data into smaller, manageable pieces, separating reads from writes, using techniques like caching, and designing for unlimited data growth. Key aspects include minimizing relations, parallelizing and distributing operations, and avoiding single points of failure.
The document discusses optimizing code and data for CPU caches through various techniques like improving data locality, reducing unnecessary memory accesses, and reusing cached data. It covers optimizing code layout, data structures, prefetching, and addressing issues like aliasing.
The document discusses various techniques for optimizing memory usage and cache performance in code. It begins by justifying the need for memory optimization given trends in CPU and memory speeds. It then provides an overview of memory hierarchies and caches. The rest of the document discusses specific techniques for optimizing data structures, prefetching, layout of data in memory, reducing aliasing, and other strategies to improve cache utilization and performance.
Elliptics is a distributed, fault-tolerant data storage system built on distributed hash table (DHT) principles. It is designed to store medium to large data records from 1KB to terabytes in size. Key features include no single point of failure, high availability even during network or hardware failures, fast read and write speeds through techniques like asynchronous I/O, caching and direct peer-to-peer data streaming for large files. Elliptics ensures data consistency through techniques like replication across data centers and automatic data repartitioning when nodes are added or removed. It provides a simple interface for use in C/C++, Go and via HTTP/REST.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2aaronmorton
This document provides an introduction to Apache Cassandra, including an overview of key concepts like the cluster, nodes, data model, and data modeling best practices. It discusses Cassandra's origins and popularity. The presentation covers the cluster architecture with consistent hashing and token ranges, replication strategies, consistency levels, and more. It also summarizes the Cassandra data model including tables, columns, SSTables, caching, compaction and discusses building a Twitter-like data model in CQL.
Handling Data in Mega Scale Web SystemsVineet Gupta
The document discusses several challenges faced by large-scale web companies in managing enormous and rapidly growing amounts of data. It provides examples of architectures developed by companies like Google, Amazon, Facebook and others to distribute data and queries across thousands of servers. Key approaches discussed include distributed databases, data partitioning, replication, and eventual consistency.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
The document discusses moving away from traditional backend architectures towards more real-time and distributed systems. It advocates for storing data in a distributed database and processing it asynchronously in batches to improve user experience. Concrete examples are given of using protocols like ProtoBufs over REST, writing data from mobile clients to partitions, and performing analytics on batches of data later.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Performance and Predictability - Richard WarburtonJAXLondon2014
This document discusses various low-level performance optimizations related to branch prediction, memory access, storage, and conclusions. It explains that branches can cause stalls, caches help mitigate slow memory access, and sequential access patterns outperform random access. The key themes are optimizing for predictability over randomness and prioritizing principles over specific tools.
Sql on hadoop the secret presentation.3pptxPaulo Alonso
This document discusses using SQL on Hadoop to enable faster analytics. It notes that while Hadoop is good for batch processing large datasets, SQL on Hadoop can provide faster access to data for interactive queries. The document discusses using in-memory technologies to improve SQL query performance on Hadoop and enable lower latency queries. It also discusses building an analytical platform that can query data stored in Hadoop, data warehouses, and other sources to provide business users with faster, self-service access to data.
The document discusses data partitioning and distribution across multiple machines in a cluster. It explains that data replication does not scale well, but data partitioning, where each record exists on only one machine, allows write latency to scale with the number of machines in the cluster. Coherence provides a distributed cache that partitions data and offers functions for server-side processing near the data through tools like entry processors.
This document summarizes a presentation about using Redis for duplicate document detection in a real-time data stream. The key points covered include:
- Redis is used to map external document IDs to internal IDs and cache these mappings to detect duplicates efficiently
- Lua scripting is used to generate IDs and check for duplicates in an atomic way
- Redis data structures like hashes and counters help count documents and store metadata efficiently
- A production deployment involved a single Redis server handling 70M keys and 10GB of RAM, with replication for high availability
Redis - for duplicate detection on real time streamCodemotion
Roberto "frank" Franchini presenta a Codemotion Techmeetup Torino Redis, un data structure server che può utilizzare come chiavi stringhe, hashes, lists, sets, sorted sets, bitmaps e hyperloglogs
.
The document provides an overview of the Arduino programming language and hardware. It describes the basic structure of an Arduino program with setup() and loop() functions. It lists the main data types and functions for digital and analog input/output, time, math, random numbers, serial communication and more. It also provides information on libraries, the Arduino board pins and components, and compares Arduino to the Processing language.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
Here are a few reasons why using the reducer as the combiner doesn't work for computing the mean:
1. The reducer expects an iterator of values for a given key. But for the mean, we need to track the sum and count across all values for a key.
2. The reducer is called once per unique key. But to compute the mean, we need to track partial sums and counts across multiple invocations for the same key. There is no way to preserve state between calls to the reducer.
3. The reducer output type needs to match the mapper output type. But for the mean, the mapper emits (key, value) pairs while the reducer would need to emit (key, sum, count
@pavlobaron Why monitoring sucks and how to improve itPavlo Baron
The document discusses why monitoring systems currently have issues and how they could be improved. Some key problems identified are that monitoring tools are not consistent or adaptive to changing IT systems, they rely on outdated technologies, use simple mathematical models, lack timely and high-resolution data, have binary alerting, and provide limited opportunities for feedback. The document argues monitoring tools should be more intelligent, adaptive, and aim to reduce the need for expensive human experts.
Why we do tech the way we do tech now (@pavlobaron)Pavlo Baron
The document discusses why technology is developed and implemented the way it currently is. It states that money prioritizes speed and cost over quality, and that patchwork solutions, copying existing code, and using free tools are cheaper than standards, rewrites, training, and planned development. It concludes that businesses adapted to prioritize technology to increase productivity and revenue, and that technology continues to disrupt and transform businesses and development practices.
More Related Content
Similar to Big Data & NoSQL - EFS'11 (Pavlo Baron)
NoSQL addresses issues related to large volumes of data, including poorly structured data, simplicity of data management, frequent reads and writes, big data streams, huge data storage needs, fast data filtering, complex relationships, and real-time processing and analysis. It works by chopping data into smaller, manageable pieces, separating reads from writes, using techniques like caching, and designing for unlimited data growth. Key aspects include minimizing relations, parallelizing and distributing operations, and avoiding single points of failure.
The document discusses optimizing code and data for CPU caches through various techniques like improving data locality, reducing unnecessary memory accesses, and reusing cached data. It covers optimizing code layout, data structures, prefetching, and addressing issues like aliasing.
The document discusses various techniques for optimizing memory usage and cache performance in code. It begins by justifying the need for memory optimization given trends in CPU and memory speeds. It then provides an overview of memory hierarchies and caches. The rest of the document discusses specific techniques for optimizing data structures, prefetching, layout of data in memory, reducing aliasing, and other strategies to improve cache utilization and performance.
Elliptics is a distributed, fault-tolerant data storage system built on distributed hash table (DHT) principles. It is designed to store medium to large data records from 1KB to terabytes in size. Key features include no single point of failure, high availability even during network or hardware failures, fast read and write speeds through techniques like asynchronous I/O, caching and direct peer-to-peer data streaming for large files. Elliptics ensures data consistency through techniques like replication across data centers and automatic data repartitioning when nodes are added or removed. It provides a simple interface for use in C/C++, Go and via HTTP/REST.
Cassandra Community Webinar | Introduction to Apache Cassandra 1.2DataStax
Title: Introduction to Apache Cassandra 1.2
Details: Join Aaron Morton, DataStax MVP for Apache Cassandra and learn the basics of the massively scalable NoSQL database. This webinar is will examine C*’s architecture and its strengths for powering mission-critical applications. Aaron will introduce you to core concepts such as Cassandra’s data model, multi-datacenter replication, and tunable consistency. He’ll also cover new features in Cassandra version 1.2 including virtual nodes, CQL 3 language and query tracing.
Speaker: Aaron Morton, Apache Cassandra Committer
Aaron Morton is a Freelance Developer based in New Zealand, and a Committer on the Apache Cassandra project. In 2010, he gave up the RDBMS world for the scale and reliability of Cassandra. He now spends his time advancing the Cassandra project and helping others get the best out of it.
Cassandra Community Webinar - Introduction To Apache Cassandra 1.2aaronmorton
This document provides an introduction to Apache Cassandra, including an overview of key concepts like the cluster, nodes, data model, and data modeling best practices. It discusses Cassandra's origins and popularity. The presentation covers the cluster architecture with consistent hashing and token ranges, replication strategies, consistency levels, and more. It also summarizes the Cassandra data model including tables, columns, SSTables, caching, compaction and discusses building a Twitter-like data model in CQL.
Handling Data in Mega Scale Web SystemsVineet Gupta
The document discusses several challenges faced by large-scale web companies in managing enormous and rapidly growing amounts of data. It provides examples of architectures developed by companies like Google, Amazon, Facebook and others to distribute data and queries across thousands of servers. Key approaches discussed include distributed databases, data partitioning, replication, and eventual consistency.
Cassandra introduction apache con 2014 budapestDuyhai Doan
This document provides an introduction and summary of Cassandra presented by Duy Hai Doan. It discusses Cassandra's history as a NoSQL database created at Facebook and open sourced in 2008. The key architecture of Cassandra including its data distribution across nodes, replication for failure tolerance, and consistency models for reads and writes is summarized.
The document discusses moving away from traditional backend architectures towards more real-time and distributed systems. It advocates for storing data in a distributed database and processing it asynchronously in batches to improve user experience. Concrete examples are given of using protocols like ProtoBufs over REST, writing data from mobile clients to partitions, and performing analytics on batches of data later.
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
This was a talk that I gave at CERN at the Inter-experimental Machine Learning (IML) Working Group Meeting in April 2017 about language-agnostic (or polyglot) analysis workflows. I show how it is possible to work in multiple languages and switch between them without leaving the workflow you started. Additionally, I demonstrate how an entire workflow can be encapsulated in a markdown file that is rendered to a publishable paper with cross-references and a bibliography (and with raw LaTeX file produced as a by-product) in a simple process, making the whole analysis workflow reproducible. For experimental particle physics, ROOT is the ubiquitous data analysis tool, and has been for the last 20 years old, so I also talk about how to exchange data to and from ROOT.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data. In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Performance and Predictability - Richard WarburtonJAXLondon2014
This document discusses various low-level performance optimizations related to branch prediction, memory access, storage, and conclusions. It explains that branches can cause stalls, caches help mitigate slow memory access, and sequential access patterns outperform random access. The key themes are optimizing for predictability over randomness and prioritizing principles over specific tools.
Sql on hadoop the secret presentation.3pptxPaulo Alonso
This document discusses using SQL on Hadoop to enable faster analytics. It notes that while Hadoop is good for batch processing large datasets, SQL on Hadoop can provide faster access to data for interactive queries. The document discusses using in-memory technologies to improve SQL query performance on Hadoop and enable lower latency queries. It also discusses building an analytical platform that can query data stored in Hadoop, data warehouses, and other sources to provide business users with faster, self-service access to data.
The document discusses data partitioning and distribution across multiple machines in a cluster. It explains that data replication does not scale well, but data partitioning, where each record exists on only one machine, allows write latency to scale with the number of machines in the cluster. Coherence provides a distributed cache that partitions data and offers functions for server-side processing near the data through tools like entry processors.
This document summarizes a presentation about using Redis for duplicate document detection in a real-time data stream. The key points covered include:
- Redis is used to map external document IDs to internal IDs and cache these mappings to detect duplicates efficiently
- Lua scripting is used to generate IDs and check for duplicates in an atomic way
- Redis data structures like hashes and counters help count documents and store metadata efficiently
- A production deployment involved a single Redis server handling 70M keys and 10GB of RAM, with replication for high availability
Redis - for duplicate detection on real time streamCodemotion
Roberto "frank" Franchini presenta a Codemotion Techmeetup Torino Redis, un data structure server che può utilizzare come chiavi stringhe, hashes, lists, sets, sorted sets, bitmaps e hyperloglogs
.
The document provides an overview of the Arduino programming language and hardware. It describes the basic structure of an Arduino program with setup() and loop() functions. It lists the main data types and functions for digital and analog input/output, time, math, random numbers, serial communication and more. It also provides information on libraries, the Arduino board pins and components, and compares Arduino to the Processing language.
Spark streaming can be used for near-real-time data analysis of data streams. It processes data in micro-batches and provides windowing operations. Stateful operations like updateStateByKey allow tracking state across batches. Data can be obtained from sources like Kafka, Flume, HDFS and processed using transformations before being saved to destinations like Cassandra. Fault tolerance is provided by replicating batches, but some data may be lost depending on how receivers collect data.
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Matthew Lease
Here are a few reasons why using the reducer as the combiner doesn't work for computing the mean:
1. The reducer expects an iterator of values for a given key. But for the mean, we need to track the sum and count across all values for a key.
2. The reducer is called once per unique key. But to compute the mean, we need to track partial sums and counts across multiple invocations for the same key. There is no way to preserve state between calls to the reducer.
3. The reducer output type needs to match the mapper output type. But for the mean, the mapper emits (key, value) pairs while the reducer would need to emit (key, sum, count
Similar to Big Data & NoSQL - EFS'11 (Pavlo Baron) (20)
@pavlobaron Why monitoring sucks and how to improve itPavlo Baron
The document discusses why monitoring systems currently have issues and how they could be improved. Some key problems identified are that monitoring tools are not consistent or adaptive to changing IT systems, they rely on outdated technologies, use simple mathematical models, lack timely and high-resolution data, have binary alerting, and provide limited opportunities for feedback. The document argues monitoring tools should be more intelligent, adaptive, and aim to reduce the need for expensive human experts.
Why we do tech the way we do tech now (@pavlobaron)Pavlo Baron
The document discusses why technology is developed and implemented the way it currently is. It states that money prioritizes speed and cost over quality, and that patchwork solutions, copying existing code, and using free tools are cheaper than standards, rewrites, training, and planned development. It concludes that businesses adapted to prioritize technology to increase productivity and revenue, and that technology continues to disrupt and transform businesses and development practices.
Current databases and database access methods are not well suited for reactive, streaming applications. A "living database" is proposed that treats all data as a continuous stream ordered by time, published on channels. Queries would be continuous and results materialized and published continuously as well. The database would be an active participant in the overall reactive data flow.
Becoming reactive without overreacting (@pavlobaron)Pavlo Baron
This document discusses the reactive approach to building enterprise systems. It argues that becoming reactive means making reactiveness explicit rather than implicit. All systems are inherently reactive under the hood due to things like threading, event loops, and interrupts. The benefits of reactive systems include responsiveness to change and resource efficiency. Some challenges to going reactive include overcoming the temptation to go fully or partially reactive without understanding reactivity, and writing one's own reactive framework instead of using existing ones.
The hidden costs of the parallel world (@pavlobaron)Pavlo Baron
The document discusses the challenges of parallel and distributed computing. It notes that while parallelism can provide performance benefits, it also introduces significant complexity. The document argues that parallelism should be implemented transparently at the runtime level, rather than requiring new programming models, in order to make the technology easier for developers to understand and use. It concludes that current parallel tools and approaches remain limited until they can be effectively adopted by all developers, not just experts.
The document discusses the results of a study on the impact of COVID-19 lockdowns on air pollution. Researchers found that lockdowns led to significant short-term reductions in nitrogen dioxide and fine particulate matter pollution globally as economic activities slowed. However, the impacts on greenhouse gases and long-term air quality improvements remain uncertain without permanent behavior and economic changes.
Data on its way to history, interrupted by analytics and silicon (@pavlobaron)Pavlo Baron
This document discusses the challenges and strategies for building a system that continuously analyzes event streams in real-time to detect patterns and anomalies. Some key challenges include handling high event volumes, performing one-pass algorithms efficiently, and dealing with non-deterministic event ordering. The system is being built on the JVM to take advantage of its concurrency features but may require optimizations like off-heap storage to handle memory pressure from garbage collection. The goal is to process over a million events per second on a single machine through techniques like parallelization and sampling.
This document discusses functional reactive programming (FRP) as a way to handle ever-changing data. It explains that in FRP, values can change over time and logic does not explicitly account for time, but instead recomputes based on value changes. This allows modeling of signals, events, flows, transports, and logic separately. It provides examples using Erlang and Elm frameworks that partially or fully implement FRP concepts. Finally, it suggests FRP could be useful for problems involving interaction, streaming data, robotics, continuous analytics, and machine-to-machine communication where logic needs to reapply on constant value changes.
Near realtime analytics - technology choice (@pavlobaron)Pavlo Baron
The document compares and contrasts the characteristics of Wile E. Coyote and Road Runner from the Looney Tunes cartoons. It describes Coyote as slow, offline, having a wide field of vision, long memory, being proactive, thorough, and always losing to Road Runner. Road Runner is described as fast, always running, having a narrow field of vision, short memory, being reactive, spontaneous, and always winning against Coyote. The document then discusses the differences between batch and near real-time analytics and technologies.
Set this Big Data technology zoo in order (@pavlobaron)Pavlo Baron
The document discusses strategies for processing big data in real-time or near real-time. It defines different levels of processing speed from real-time to near real-time to fast to batch processing. It emphasizes that to gain useful information from live data as close to real-time as possible provides the greatest business advantage. It provides recommendations for optimizing various parts of the data pipeline for speed, such as using optimized data formats, parallel processing, in-memory storage, and avoiding unnecessary movement or abstraction of data.
a Tech guy’s take on Big Data business cases (@pavlobaron)Pavlo Baron
The document discusses big data, describing what it is and is not. It argues that big data is about gaining useful information from various data sources to increase a company's value through faster and better decisions and predictions. It asserts that big data is a necessity and that any company can and should leverage it by knowing their customers, business, competitors and offers to make and save money.
Diving into Erlang is a one-way ticket (@pavlobaron)Pavlo Baron
The author describes their journey through various programming languages from assembly to C/C++ to Java. They were asked to parallelize some C code using Erlang and spent 3 months researching it, finding that Erlang was a good solution. This sparked further interest in Erlang, leading the author to write a book on it and incorporate Erlang concepts into other languages. The author believes Erlang is a precise and useful tool for distributed systems and messaging problems. They encourage others to learn Erlang as well.
The document summarizes key concepts of Dynamo including:
- Dynamo focuses on immediate, reliable writes and operation relaxation rather than speed.
- It provides distribution, fault tolerance, and almost linear scalability.
- Vector clocks are used to track causality and determine operation order across nodes.
- Merkle trees enable efficient delta tracking during data replication.
- Gossip protocols are used for node discovery and failure detection.
- Dynamo supports eventual consistency through techniques like hinted handoff and quorum replication.
- The CAP theorem establishes that a distributed system can only optimally support two of consistency, availability, and partition tolerance at once.
Chef's Coffee - provisioning Java applications with Chef (@pavlobaron)Pavlo Baron
Pavlo Baron discusses how Chef can be used to configure and manage infrastructure as code. Chef allows you to pick a hosting strategy, manage environments and configurations, and define roles and run lists to deploy applications. Knife and Ohai tools can be used to manage nodes and retrieve information. Infrastructure as code with Chef provides abstractions and allows reusing configurations across environments and applications.
What can be done with Java, but should better be done with Erlang (@pavlobaron)Pavlo Baron
Erlang excels at building distributed, fault-tolerant, concurrent applications due to its lightweight process model and built-in support for distribution. However, Java is more full-featured and is generally a better choice for applications that require more traditional object-oriented capabilities or need to interface with existing Java libraries and frameworks. Both languages have their appropriate uses depending on the requirements of the specific application being developed.
20 reasons why we don't need architects (@pavlobaron)Pavlo Baron
This document discusses the changing role of software architects and argues that they are no longer needed. It notes that agility has become mainstream and that conflicts between architects and developers are too large. It suggests that the team as a whole can serve as the architect and that what is needed are tools to help with architecture management, workflow, and reproducing architectures across teams. The document questions what an architect should be in this new landscape, listing roles like visionary, chief motivator, and worker.
Theoretical aspects of distributed systems - playfully illustrated (@pavlobaron)Pavlo Baron
This document discusses key concepts in distributed systems including:
1. Computer clocks are inconsistent and time is relative in distributed systems, requiring logical clocks to track changes.
2. Consistent hashing allows data to be localized on changing infrastructure with minimal reorganization through ring hashing with fixed rules.
3. Nodes gossip to asynchronously share data and infrastructure changes in unpredictable order and repetition.
The document is a presentation about agile methodologies and excuses for not being truly agile. It contains stories and examples to illustrate key agile principles like simplicity, self-organization, motivated individuals, continuous improvement, changing requirements, face-to-face conversation, technical excellence, collaboration, valuable software, and working software. The overall message is that no methodology alone ensures agility and that true agility requires embracing certain mindsets and behaviors.
Harry Potter and Enormous Data (Pavlo Baron)Pavlo Baron
The document discusses challenges related to handling enormous amounts of data and provides recommendations for technologies and approaches to address those challenges. It notes that as data volumes increase, traditional databases and tools may no longer suffice. It recommends distributed, parallelized, and real-time approaches like Hadoop, stream processing engines, graph databases, GPUs, and cloud-based storage and CDNs to optimize for large volumes of data from various sources like sensors, logs, videos and more. Specific use cases and questions around customization, monitoring, risk analysis, visualization and content delivery are also addressed.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Dandelion Hashtable: beyond billion requests per second on a commodity serverAntonios Katsarakis
This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).
Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.
Generating privacy-protected synthetic data using Secludy and MilvusZilliz
During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
Introduction of Cybersecurity with OSS at Code Europe 2024Hiroshi SHIBATA
I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems.
The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS.
Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application.
I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Digital Marketing Trends in 2024 | Guide for Staying AheadWask
https://www.wask.co/ebooks/digital-marketing-trends-in-2024
Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Main news related to the CCS TSI 2023 (2023/1695)Jakub Marek
An English 🇬🇧 translation of a presentation to the speech I gave about the main changes brought by CCS TSI 2023 at the biggest Czech conference on Communications and signalling systems on Railways, which was held in Clarion Hotel Olomouc from 7th to 9th November 2023 (konferenceszt.cz). Attended by around 500 participants and 200 on-line followers.
The original Czech 🇨🇿 version of the presentation can be found here: https://www.slideshare.net/slideshow/hlavni-novinky-souvisejici-s-ccs-tsi-2023-2023-1695/269688092 .
The videorecording (in Czech) from the presentation is available here: https://youtu.be/WzjJWm4IyPk?si=SImb06tuXGb30BEH .
FREE A4 Cyber Security Awareness Posters-Social Engineering part 3Data Hops
Free A4 downloadable and printable Cyber Security, Social Engineering Safety and security Training Posters . Promote security awareness in the home or workplace. Lock them Out From training providers datahops.com
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
66. Why can we never be sure till we die . Or have killed for an answer
67. CAP – C onsistency, A vailability, P artition tolerance
68. CAP – the variations CA – irrelevant CP – eventually unavailable offering maximum consistency AP – eventually inconsistent offering maximum availability
126. Hinted handoff N: node, G: group including N node(N) is unavailable replicate to G or store data(N) locally hint handoff for later node(N) is alive handoff data to node(N)
127. Key = “foo” N replicate Key = “foo”, # = N -> handoff hint = true Direct replica fails
133. MapReduce model: functional map/fold out-database MR irrelevant in-database MR: data locality no splitting needed distributed querying distributed processing
134. In-database MapReduce map reduce Node X Node C N = "Alice" map query = "Alice" Node A N = „ Alice" Node B N = "Alice" map hit list
145. Many graphics I’ve created myself, though I better should have asked @mononcqc for help ‘cause his drawings are awesome Some images originate from istockphoto.com except few ones taken from Wikipedia and product pages