Course "Machine Learning and Data Mining" for the degree of Computer Engineering at the Politecnico di Milano. In in this lecture we overview the mining of data streams
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
With the expansion of big data and analytics, organizations are looking to incorporate data streaming into their business processes to make real-time decisions.
Join this webinar as we guide you through the buzz around data streams:
- Market trends in stream processing
- What is stream processing
- How does stream processing compare to traditional batch processing
- High and low volume streams
- The possibilities of working with data streaming and the benefits it provides to organizations
- The importance of spatial data in streams
Slides for my Associate Professor (oavlönad docent) lecture.
The lecture is about Data Streaming (its evolution and basic concepts) and also contains an overview of my research.
With the expansion of big data and analytics, organizations are looking to incorporate data streaming into their business processes to make real-time decisions.
Join this webinar as we guide you through the buzz around data streams:
- Market trends in stream processing
- What is stream processing
- How does stream processing compare to traditional batch processing
- High and low volume streams
- The possibilities of working with data streaming and the benefits it provides to organizations
- The importance of spatial data in streams
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This is the Complete Information about Data Replication you need, i am focused on these topics:
What is replication?
Who use it?
Types ?
Implementation Methods?
a simple presentation about different big data stream processing systems such as SPARK, SAMZA and STORM and the difference between their architectures and purpose, in addition we talk about streaming layers tools such as Kafka and rabbitMQ, this presentation refer to this paper
https://vsis-www.informatik.uni-hamburg.de/getDoc.php/publications/561/Real-time%20stream%20processing%20for%20Big%20Data.pdf and other useful links.
Uses the example of correct, high-througput, grouping and counting of streaming events as a backdrop for exploring the state-of-the art features of Apache Flink
This presentation about Hadoop YARN will help you understand the Hadoop 1.0 and Hadoop 2.0, limitations of Hadoop 1.0, need for YARN, what is YARN, workloads running on YARN, YARN components, YARN architecture and you will also go through a demo on YARN. YARN is the cluster resource management layer of the Apache Hadoop Ecosystem, which schedules jobs and assigns resources. Hadoop 1.0 is designed to run MapReduce jobs only and had issues in scalability, resource utilization, etc. whereas YARN solved those issues and users could work on multiple processing models. Now let us get started and learn YARN in detail.
Below topics are explained in this Hadoop YARN presentation:
1. Hadoop 1.0 (MapReduce 1)
2. Limitations of Hadoop 1.0 (MapReduce 1)
3. Need for YARN
4. What is YARN
5. Workloads running on YARN
6. YARN components
7. YARN architecture
8. Demo on YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Simplify and Scale Data Engineering Pipelines with Delta LakeDatabricks
We’re always told to ‘Go for the Gold!,’ but how do we get there? This talk will walk you through the process of moving your data to the finish fine to get that gold metal! A common data engineering pipeline architecture uses tables that correspond to different quality levels, progressively adding structure to the data: data ingestion (‘Bronze’ tables), transformation/feature engineering (‘Silver’ tables), and machine learning training or prediction (‘Gold’ tables). Combined, we refer to these tables as a ‘multi-hop’ architecture. It allows data engineers to build a pipeline that begins with raw data as a ‘single source of truth’ from which everything flows. In this session, we will show how to build a scalable data engineering data pipeline using Delta Lake, so you can be the champion in your organization.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
Presentation for the Softskills Seminar course @ Telecom ParisTech. Topic is the paper by Domings Hulten "Mining high speed data streams". Presented by me the 30/11/2017
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system.
This is the Complete Information about Data Replication you need, i am focused on these topics:
What is replication?
Who use it?
Types ?
Implementation Methods?
a simple presentation about different big data stream processing systems such as SPARK, SAMZA and STORM and the difference between their architectures and purpose, in addition we talk about streaming layers tools such as Kafka and rabbitMQ, this presentation refer to this paper
https://vsis-www.informatik.uni-hamburg.de/getDoc.php/publications/561/Real-time%20stream%20processing%20for%20Big%20Data.pdf and other useful links.
Uses the example of correct, high-througput, grouping and counting of streaming events as a backdrop for exploring the state-of-the art features of Apache Flink
This presentation about Hadoop YARN will help you understand the Hadoop 1.0 and Hadoop 2.0, limitations of Hadoop 1.0, need for YARN, what is YARN, workloads running on YARN, YARN components, YARN architecture and you will also go through a demo on YARN. YARN is the cluster resource management layer of the Apache Hadoop Ecosystem, which schedules jobs and assigns resources. Hadoop 1.0 is designed to run MapReduce jobs only and had issues in scalability, resource utilization, etc. whereas YARN solved those issues and users could work on multiple processing models. Now let us get started and learn YARN in detail.
Below topics are explained in this Hadoop YARN presentation:
1. Hadoop 1.0 (MapReduce 1)
2. Limitations of Hadoop 1.0 (MapReduce 1)
3. Need for YARN
4. What is YARN
5. Workloads running on YARN
6. YARN components
7. YARN architecture
8. Demo on YARN
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
At Instagram, our mission is to capture and share the world's moments. Our app is used by over 400M people monthly; this creates a lot of challenging data needs. We use Cassandra heavily, as a general key-value storage. In this presentation, I will talk about how we use Cassandra to serve our critical use cases; the improvements/patches we made to make sure Cassandra can meet our low latency, high scalability requirements; and some pain points we have.
About the Speaker
Dikang Gu Software Engineer, Facebook
I'm a software engineer at Instagram core infra team, working on scaling Instagram infrastructure, especially on building a generic key-value store based on Cassandra. Prior to this, I worked on the development of HDFS in Facebook. I got the master degree of Computer Science in Shanghai Jiao Tong university in China.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data.
In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Invited talk at Tsinghua University on "Applications of Deep Neural Network". As the tech. lead of deep learning task force at NIO USA INC, I was invited to give this colloquium talk on general applications of deep neural network.
These days fast code needs to operate in harmony with its environment. At the deepest level this means working well with hardware: RAM, disks and SSDs. A unifying theme is treating memory access patterns in a uniform and predictable way that is sympathetic to the underlying hardware. For example writing to and reading from RAM and Hard Disks can be significantly sped up by operating sequentially on the device, rather than randomly accessing the data.
In this talk we’ll cover why access patterns are important, what kind of speed gain you can get and how you can write simple high level code which works well with these kind of patterns.
Network-aware Data Management for High Throughput Flows Akamai, Cambridge, ...balmanme
As current technology enables faster storage devices and larger interconnect bandwidth, there is a substantial need for novel system design and middleware architecture to address increasing latency, scalability, and throughput requirements. In this talk, I will outline network-aware data management and present solutions based on my past experience in large-scale data migration between remote repositories.
I will first describe my experience in the initial evaluation of 100Gbps network as a part of the Advance Network Initiative project. We needed intense fine-tuning in network, storage, and application layers, to take advantage of the higher network capacity. I will introduce a special data movement prototype, successfully tested in one of the first 100Gbps demonstrations, in which applications map memory blocks for remote data, in contrast to the send/receive semantics.
Within this scope, I will introduce a flexible network reservation algorithm for on-demand bandwidth guaranteed virtual circuit services. Flexible reservations find best path in a time-dependent dynamic network topology to support predictable application performance. I will then present a data-scheduling model with advance provisioning, in which data movement operations are defined with earliest start and latest completion times.
I will conclude my talk with a very brief overview of my other related projects on performance engineering, hyper-converged virtual storage, and optimization in control and data path for virtualized environments.
Sept 28, 2015
Akamai, Cambridge, MA
Reflections on Almost Two Decades of Research into Stream ProcessingKyumars Sheykh Esmaili
This is the slide deck that I used during my tutorial presentation at the ACM DEBS Conference (http://www.debs2017.org/) that was held in Barcelona between June 19 and June 23, 2017.
The tutorial paper itself can be accessed here: http://dl.acm.org/citation.cfm?id=3095110
Proactive Data Containers (PDC): An Object-centric Data Store for Large-scale...Globus
These slides were presented by Suren Byna from Lawrence Berkeley National Lab (LBNL) at the AGU Fall Meeting 2018 in a session titled "Scalable Data Management Practices in Earth Sciences" convened by Ian Foster, Globus co-founder and director of Argonne's data science and learning division.
The Seven Main Challenges of an Early Warning System Architecturestreamspotter
J. Moßgraber, F. Chaves, S. Middleton, Z. Zlatev, and R. Tao on "The Seven Main Challenges of an Early Warning System Architecture" at ISCRAM 2013 in Baden-Baden.
10th International Conference on Information Systems for Crisis Response and Management
12-15 May 2013, Baden-Baden, Germany
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
DMTM Lecture 13 Representative based clusteringPier Luca Lanzi
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
Slides for the 2016/2017 edition of the Data Mining and Text Mining Course at the Politecnico di Milano. The course is also part of the joint program with the University of Illinois at Chicago.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Building better applications for business users with SAP Fiori.
• What is SAP Fiori and why it matters to you
• How a better user experience drives measurable business benefits
• How to get started with SAP Fiori today
• How SAP Fiori elements accelerates application development
• How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
• How SAP Fiori paves the way for using AI in SAP apps
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™UiPathCommunity
In questo evento online gratuito, organizzato dalla Community Italiana di UiPath, potrai esplorare le nuove funzionalità di Autopilot, il tool che integra l'Intelligenza Artificiale nei processi di sviluppo e utilizzo delle Automazioni.
📕 Vedremo insieme alcuni esempi dell'utilizzo di Autopilot in diversi tool della Suite UiPath:
Autopilot per Studio Web
Autopilot per Studio
Autopilot per Apps
Clipboard AI
GenAI applicata alla Document Understanding
👨🏫👨💻 Speakers:
Stefano Negro, UiPath MVPx3, RPA Tech Lead @ BSP Consultant
Flavio Martinelli, UiPath MVP 2023, Technical Account Manager @UiPath
Andrei Tasca, RPA Solutions Team Lead @NTT Data
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Enhancing Performance with Globus and the Science DMZGlobus
ESnet has led the way in helping national facilities—and many other institutions in the research community—configure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
2. References 2
Jiawei Han and Micheline Kamber,
quot;Data Mining: Concepts and
Techniquesquot;, The Morgan Kaufmann
Series in Data Management Systems
(Second Edition)
Chapter 8, part 1
Prof. Pier Luca Lanzi
3. Mining Data Streams 3
What is stream data?
Why Stream Data Systems?
Stream data management systems: Issues and solutions
Stream data cube and multidimensional OLAP analysis
Stream frequent pattern analysis
Stream classification
Stream cluster analysis
Research issues
Prof. Pier Luca Lanzi
4. Characteristics of Data Streams 4
Data Streams vs DBMS
Data streams
continuous, ordered, changing, fast, huge amount
Traditional DBMS
data stored in finite, persistent data sets
Characteristics
Huge volumes of continuous data, possibly infinite
Fast changing and requires fast, real-time response
Data stream captures nicely our data processing needs of today
Random access is expensive
single scan algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or multi-dimensional in
nature, needs multi-level and multi-dimensional processing
Prof. Pier Luca Lanzi
5. What are the Applications? 5
Telecommunication calling records
Business: credit card transaction flows
Network monitoring and traffic engineering
Financial market: stock exchange
Engineering & industrial processes: power supply &
manufacturing
Sensor, monitoring & surveillance: video streams, RFIDs
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access is too
expensive)
Prof. Pier Luca Lanzi
6. DBMS versus DSMS 6
Persistent relations Transient streams
One-time queries Continuous queries
Random access Sequential access
“Unbounded” disk store Bounded main memory
Only current state matters Historical data is important
No real-time services Real-time requirements
Relatively low update rate Possibly multi-GB arrival rate
Data at any granularity Data at fine granularity
Assume precise data Data stale/imprecise
Access plan determined by query Unpredictable/variable data
processor, physical DB design arrival and characteristics
Ack. From Motwani’s PODS tutorial slides
Prof. Pier Luca Lanzi
7. The Architecture of Stream 7
Query Processing
User/Applications
SDMS
(Stream Data Management System)
Continuous
Query
Results
Multiple streams
Stream Query
Processor
Scratch Space
(Main memory and/or Disk)
Prof. Pier Luca Lanzi
8. What are Challenges of 8
Stream Data Processing?
Multiple, continuous, rapid, time-varying, ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data mining, OLAP)
Multi-level/multi-dimensional processing and data mining
Most stream data are at low-level or multi-dimensional in
nature
Prof. Pier Luca Lanzi
9. Processing Stream Queries 9
Query types
One-time query vs. continuous query (being evaluated
continuously as stream continues to arrive)
Predefined query vs. ad-hoc query (issued on-line)
Unbounded memory requirements
For real-time response, main memory algorithm should be used
Memory requirement is unbounded if one will join future tuples
Approximate query answering
With bounded memory, it is not always possible
to produce exact answers
High-quality approximate answers are desired
Data reduction and synopsis construction methods:
Sketches, random sampling, histograms, wavelets, etc.
Prof. Pier Luca Lanzi
10. What the Methodologies for 10
Stream Data Processing?
Major challenges
Keep track of a large universe,
e.g., pairs of IP address, not ages
Methodology
Synopses (trade-off between accuracy and storage)
Use synopsis data structure, much smaller (O(logk N) space)
than their base data set (O(N) space)
Compute an approximate answer within a small error range
(factor ε of the actual answer)
Major methods
Random sampling
Histograms
Sliding windows
Multi-resolution model
Sketches
Radomized algorithms
Prof. Pier Luca Lanzi
11. Stream Data Processing Methods (1) 11
Random sampling (but without knowing the total length in advance)
Reservoir sampling: maintain a set of s candidates in the reservoir,
which form a true random sample of the element seen so far in the
stream. As the data stream flow, every new element has a certain
probability (s/N) of replacing an old element in the reservoir.
Sliding windows
Make decisions based only on recent data of sliding window size w
An element arriving at time t expires at time t + w
Histograms
Approximate the frequency distribution of element values in a stream
Partition data into a set of contiguous buckets
Equal-width (equal value range for buckets) vs. V-optimal (minimizing
frequency variance within each bucket)
Multi-resolution models
Popular models: balanced binary trees, micro-clusters, and wavelets
Prof. Pier Luca Lanzi
12. Stream Data Processing Methods (2) 12
Sketches
Histograms and wavelets require multi-passes over the
data but sketches can operate in a single pass
Frequency moments of a stream A = {a1, …, aN}, Fk:
v
∑m
Fk =
k
i
i =1
where v: the universe or domain size, mi: the frequency
of i in the sequence
Given N elts and v values, sketches can approximate F0,
F1, F2 in O(log v + log N) space
Prof. Pier Luca Lanzi
13. Stream Data Processing Methods (3) 13
Randomized algorithms
Monte Carlo algorithm: bound on running time but may
not return correct result
Chebyshev’s inequality: Let X be a random variable with
mean μ and standard deviation σ
σ 2
P (| X − μ |> k ) ≤
k2
Chernoff bound:
• Let X be the sum of independent Poisson trials X1, …,
Xn, δ in (0, 1]
• The probability decreases expoentially as we move
from the mean
− μδ 2 / 4
P[ X < (1 + δ ) μ |] < e
Prof. Pier Luca Lanzi
14. Approximate Query Answering in Streams 14
Sliding windows
Only over sliding windows of recent stream data
Approximation but often more desirable in applications
Batched processing, sampling and synopses
Batched if update is fast but computing is slow
• Compute periodically, not very timely
Sampling if update is slow but computing is fast
• Compute using sample data, but not good for joins, etc.
Synopsis data structures
• Maintain a small synopsis or sketch of data
• Good for querying historical data
Blocking operators, e.g., sorting, avg, min, etc.
Blocking if unable to produce the first output until seeing the
entire input
Prof. Pier Luca Lanzi
15. Projects on DSMS (Data Stream 15
Management System)
Research projects and system prototypes
STREAM (Stanford): A general-purpose DSMS
Cougar (Cornell): sensors
Aurora (Brown/MIT): sensor monitoring, dataflow
Hancock (AT&T): telecom streams
Niagara (OGI/Wisconsin): Internet XML databases
OpenCQ (Georgia Tech): triggers, incr. view maintenance
Tapestry (Xerox): pub/sub content-based filtering
Telegraph (Berkeley): adaptive engine for sensors
Tradebot (www.tradebot.com): stock tickers & streams
Tribeca (Bellcore): network monitoring
MAIDS (UIUC/NCSA): Mining Alarming Incidents in Data
Streams
Prof. Pier Luca Lanzi
16. Stream Data Mining vs. Stream Querying 16
Stream mining—A more challenging task in many cases
It shares most of the difficulties with stream querying
But often requires less “precision”,
e.g., no join, grouping, sorting
Patterns are hidden and more general than querying
It may require exploratory analysis,
not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Mining outliers and unusual patterns in stream data
Clustering data streams
Classification of stream data
Prof. Pier Luca Lanzi
17. Challenges for Mining Dynamics 17
in Data Streams
Most stream data are at pretty low-level or multi-
dimensional in nature: needs ML/MD processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/levels
Fast, real-time detection and response
Comparing with data cube: Similarity and differences
Stream (data) cube or stream OLAP: Is this feasible?
Can we implement it efficiently?
Prof. Pier Luca Lanzi
18. Multi-Dimensional Stream Analysis: 18
Examples
Analysis of Web click streams
Raw data at low levels: seconds, web page addresses,
user IP addresses, …
Analysts want: changes, trends, unusual patterns, at
reasonable levels of details
E.g., Average clicking traffic in North America on sports in
the last 15 minutes is 40% higher than that in the last 24
hours.”
Analysis of power consumption streams
Raw data: power consumption flow for every household,
every minute
Patterns one may find: average hourly power
consumption surges up 30% for manufacturing companies
in Chicago in the last 2 hours today than that of the same
day a week ago
Prof. Pier Luca Lanzi
19. A Stream Cube Architecture 19
A tilted time frame
Different time granularities:
second, minute, quarter, hour, day, week, …
Critical layers
Minimum interest layer (m-layer)
Observation layer (o-layer)
User: watches at o-layer and occasionally needs to drill-
down down to m-layer
Partial materialization of stream cubes
Full materialization: too space and time consuming
No materialization: slow response at query time
Partial materialization: what do we mean “partial”?
Prof. Pier Luca Lanzi
20. A Titled Time Model (1) 20
Natural tilted time frame:
Example: Minimal: quarter, then 4 quarters → 1 hour, 24
hours → day, …
4 qtrs
24 hours
12 months 31 days
time
Logarithmic tilted time frame:
Example: Minimal: 1 minute, then 1, 2, 4, 8, 16, 32, …
64t 32t 16t 8t 4t 2t t t
Time
Prof. Pier Luca Lanzi
21. A Titled Time Model (2) 21
Pyramidal tilted time frame
Example: Suppose there are 5 frames and each takes
maximal 3 snapshots
Given a snapshot number N, if N mod 2d = 0, insert into
the frame number d. If there are more than 3 snapshots,
“kick out” the oldest one.
Frame no. Snapshots (by clock time)
0 69 67 65
1 70 66 62
2 68 60 52
3 56 40 24
4 48 16
5 64 32
Prof. Pier Luca Lanzi
22. Two Critical Layers in the Stream Cube 22
(*, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
Prof. Pier Luca Lanzi
23. On-Line Partial Materialization vs. 23
OLAP Processing
On-line materialization
Materialization takes precious space and time
• Only incremental materialization (with tilted time frame)
Only materialize “cuboids” of the critical layers?
• Online computation may take too much time
Preferred solution:
• popular-path approach: Materializing those along the popular
drilling paths
• H-tree structure: Such cuboids can be computed and stored
efficiently using the H-tree structure
Online aggregation vs. query-based computation
Online computing while streaming:
aggregating stream cubes
Query-based computation:
using computed cuboids
Prof. Pier Luca Lanzi
25. 25
An H-Tree Cubing Structure
root
Observation layer
Urbana Springfield
Chicago
.gov
.com .com
.edu
Minimal int. layer Elec Bio
Elec Chem
6:00AM-7:00AM 156
7:00AM-8:00AM 201
8:00AM-9:00AM 235
……
Prof. Pier Luca Lanzi
26. Benefits of H-Tree and H-Cubing 26
H-tree and H-cubing
Developed for computing data cubes and ice-berg cubes
Fast cubing, space preserving in cube computation
Using H-tree for stream cubing
Space preserving: intermediate aggregates can be computed
incrementally and saved in tree nodes
Facilitate computing other cells and multi-dimensional analysis
H-tree with computed cells can be viewed as stream cube
Prof. Pier Luca Lanzi
27. Frequent Patterns for Stream Data 27
Frequent pattern mining is valuable in stream applications
e.g., network intrusion mining
Mining precise freq. patterns in stream data: unrealistic
Even store them in a compressed form, such as FPtree
How to mine frequent patterns with good approximation?
Approximate frequent patterns (Manku & Motwani VLDB’02)
Keep only current frequent patterns? No changes can be
detected
Mining evolution freq. patterns
(C. Giannella, J. Han, X. Yan, P.S. Yu, 2003)
Use tilted time window frame
Mining evolution and dramatic changes of frequent patterns
Space-saving computation of frequent and top-k elements
(Metwally, Agrawal, and El Abbadi, ICDT'05)
Prof. Pier Luca Lanzi
28. Mining Approximate Frequent Patterns 28
Mining precise freq. patterns in stream data: unrealistic
Even store them in a compressed form, such as FPtree
Approximate answers are often sufficient
(e.g., trend/pattern analysis)
Example: a router is interested in all flows:
whose frequency is at least 1% (σ) of the entire traffic stream
seen so far
and feels that 1/10 of σ (ε = 0.1%) error is comfortable
How to mine frequent patterns with good approximation?
Lossy Counting Algorithm (Manku & Motwani, VLDB’02)
Major ideas: not tracing items until it becomes frequent
Adv: guaranteed error bound
Disadv: keep a large set of traces
Prof. Pier Luca Lanzi
29. Lossy Counting for Frequent Items 29
Bucket 1 Bucket 2 Bucket 3
Divide Stream into ‘Buckets’ (bucket size is 1/ ε = 1000)
Prof. Pier Luca Lanzi
30. First Bucket of Stream 30
Empty
+
(summary)
At bucket boundary, decrease all counters by 1
Prof. Pier Luca Lanzi
31. Next Bucket of Stream 31
+
At bucket boundary, decrease all counters by 1
Prof. Pier Luca Lanzi
32. Approximation Guarantee 32
Given
(1) support threshold: σ,
(2) error threshold: ε, and
(3) stream length N
Output: items with frequency counts exceeding (σ – ε) N
How much do we undercount?
If (stream length seen so far = N)
and (bucket-size = 1/ε)
then frequency count error ≤ #buckets = εN
Approximation guarantee
No false negatives
False positives have true frequency count at least (σ–ε)N
Frequency count underestimated by at most εN
The space requirement is limited to 1/ε log(εN)
Prof. Pier Luca Lanzi
33. Lossy Counting For Frequent Itemsets 33
Divide Stream into ‘Buckets’ as for frequent items
But fill as many buckets as possible in main memory one time
Bucket 1 Bucket 2 Bucket 3
If we put 3 buckets of data into main memory one time,
Then decrease each frequency count by 3
Prof. Pier Luca Lanzi
34. Update of Summary Data Structure 34
2 4 3
2 4 3
1
+
2 10 9
1 2
1 2
1 0
3 bucket data summary data
summary data
in memory
Itemset ( ) is deleted.
That’s why we choose a large number of buckets
– delete more
Prof. Pier Luca Lanzi
35. Pruning Itemsets – Apriori Rule 35
1
2
2
+
1
1
3 bucket data
summary data
in memory
If we find itemset ( ) is not frequent itemset,
Then we needn’t consider its superset
Prof. Pier Luca Lanzi
36. Summary of Lossy Counting 36
Strength
A simple idea
Can be extended to frequent itemsets
Weakness:
Space Bound is not good
For frequent itemsets, they do scan each record many times
The output is based on all previous data. But sometimes, we are
only interested in recent data
A space-saving method for stream frequent item mining
Metwally, Agrawal and El Abbadi, ICDT'05
Prof. Pier Luca Lanzi
37. Mining Evolution of Frequent Patterns 37
for Stream Data
Approximate frequent patterns (Manku & Motwani VLDB’02)
Keep only current frequent patterns:
No changes can be detected
Mining evolution and dramatic changes of frequent patterns
(Giannella, Han, Yan, Yu, 2003)
Use tilted time window frame
Use compressed form to store significant (approximate)
frequent patterns and their time-dependent traces
Note: To mine precise counts, one has to trace/keep a fixed
(and small) set of items
Prof. Pier Luca Lanzi
38. Two Structures for Mining Frequent Patterns 38
with Tilted-Time Window (1)
FP-Trees store Frequent
Patterns, rather than Transactions
Tilted-time major: An FP-tree for
each tilted time frame
Prof. Pier Luca Lanzi
39. Two Structures for Mining Frequent Patterns 39
with Tilted-Time Window (2)
The second data structure:
Observation: FP-Trees of different time units are similar
Pattern-tree major: each node is associated with a tilted-time
window
Prof. Pier Luca Lanzi
40. Classification in Data Streams 40
What are the issues?
It is impossible to store the whole data set,
as traditional classification algorithms require
It is usually not possible to perform multiple scans
of the input data
Data streams are time-varying!
There is concept drift.
Prof. Pier Luca Lanzi
41. Classification for Dynamic Data Streams 41
Decision tree induction for stream data classification
VFDT (Very Fast Decision Tree)/CVFDT
Is decision-tree good for modeling fast changing data, e.g.,
stock market analysis?
Other stream classification methods
Instead of decision-trees, consider other models:
Naïve Bayesian,
Ensemble (Wang, Fan, Yu, Han. KDD’03)
K-nearest neighbors (Aggarwal, Han, Wang, Yu. KDD’04)
Tilted time framework, incremental updating, dynamic
maintenance, and model construction
Comparing of models to find changes
Prof. Pier Luca Lanzi
42. Hoeffding Tree 42
Initially introduced to analyze click-streams
With high probability, classifies tuples the same
Only uses small sample
Based on Hoeffding Bound principle
Hoeffding Bound (Additive Chernoff Bound)
r: random variable representing the attribute selection method
R: range of r
n: # independent observations
Mean of r is at least ravg – ε, with probability 1 – δ
ln( 1 / δ )
2
R
ε=
2n
Prof. Pier Luca Lanzi
43. Hoeffding Tree Algorithm 43
The algorithm uses the bound to determin, with high
probability the smallest number N of examples needed at a
node to select the splitting attribute
Prof. Pier Luca Lanzi
44. Hoeffding Tree Algorithm 44
Hoeffding Tree Input
S: sequence of examples
X: attributes
G( ): evaluation function
δ: desired accuracy
for each example in S
retrieve G(Xa) and G(Xb) //two highest G(Xi)
if ( G(Xa) – G(Xb) > ε )
split on Xa
recurse to next node
break
Complexity is O(ldvc) where l is the depth, d is the number of
attributes, v is the maximum number of attributes, c is the number
of classes
Prof. Pier Luca Lanzi
45. Decision-Tree Induction 45
with Data Streams
Packets > 10
Data Stream
yes no
Protocol = http
Packets > 10
Data Stream
yes no
Bytes > 60K
Protocol = http
yes
Protocol = ftp From Gehrke’s SIGMOD tutorial slides
Prof. Pier Luca Lanzi
46. Hoeffding Tree: 46
Strengths and Weaknesses
Strengths
Scales better than traditional methods
Sublinear with sampling
Very small memory utilization
Incremental
Make class predictions in parallel
New examples are added as they come
Weaknesses
Could spend a lot of time with ties
Memory used with tree expansion
Number of candidate attributes
Prof. Pier Luca Lanzi
47. VFDT (Very Fast Decision Tree) 47
Modifications to Hoeffding Tree
Near-ties broken more aggressively
G computed every nmin
Deactivates certain leaves to save memory
Poor attributes dropped
Initialize with traditional learner (helps learning curve)
Compare to Hoeffding Tree: Better time and memory
Compare to traditional decision tree
Similar accuracy
Better runtime with 1.61 million examples
• 21 minutes for VFDT
• 24 hours for C4.5
Still does not handle concept drift
Prof. Pier Luca Lanzi
48. CVFDT (Concept-adapting VFDT) 48
Concept Drift
Time-changing data streams
Incorporate new and eliminate old
CVFDT
Increments count with new example
Decrement old example
• Sliding window
• Nodes assigned monotonically increasing IDs
Grows alternate subtrees
When alternate more accurate, then replace old
O(w) better runtime than VFDT-window
Prof. Pier Luca Lanzi
49. Ensemble of Classifiers Algorithm 49
H. Wang, W. Fan, P. S. Yu, and J. Han, “Mining Concept-
Drifting Data Streams using Ensemble Classifiers”, KDD'03.
Method (derived from the ensemble idea in classification)
train K classifiers from K chunks
for each subsequent chunk
train a new classifier
test other classifiers against the chunk
assign weight to each classifier
select top K classifiers
Prof. Pier Luca Lanzi
50. Clustering Evolving Data Streams 50
What methodologies?
Compute and store summaries of past data
Apply a divide-and-conquer strategy
Incremental clustering of incoming data streams
Perform microclustering as well as macroclustering anlysis
Explore multiple time granularity for the analysis of cluster
evolution
Divide stream clustering into on-line and off-line processes
Prof. Pier Luca Lanzi
51. Clustering Data Streams 51
Base on the k-median method
Data stream points from metric space
Find k clusters in the stream s.t. the sum of distances
from data points to their closest center is minimized
Constant factor approximation algorithm
In small space, a simple two step algorithm:
1. For each set of M records, Si,
find O(k) centers in S1, …, Sl
Local clustering: Assign each point in Si to its closest
center
2. Let S’ be centers for S1, …, Sl with each
center weighted by number of points assigned to it
Cluster S’ to find k centers
Prof. Pier Luca Lanzi
53. Hierarchical Tree and Drawbacks 53
Method
Maintain at most m level-i medians
On seeing m of them, generate O(k) level-(i+1) medians
of weight equal to the sum of the weights of the
intermediate medians assigned to them
Drawbacks
Low quality for evolving data streams
(register only k centers)
Limited functionality in discovering and exploring clusters
over different portions of the stream over time
Prof. Pier Luca Lanzi
54. Clustering for Mining Stream Dynamics 54
Network intrusion detection: one example
Detect bursts of activities or abrupt changes in real time—
by on-line clustering
The methodology
by C. Agarwal, J. Han, J. Wang, P.S. Yu, VLDB’03
Tilted time frame work:
o.w. dynamic changes cannot be found
Micro-clustering: better quality than k-means/k-median
• incremental, online processing and maintenance)
Two stages: micro-clustering and macro-clustering
With limited “overhead” to achieve high efficiency,
scalability, quality of results and power of
evolution/change detection
Prof. Pier Luca Lanzi
55. CluStream: A Framework 55
for Clustering Evolving Data Streams
Design goal
High quality for clustering evolving data streams with
greater functionality
While keep the stream mining requirement in mind
• One-pass over the original stream data
• Limited space usage and high efficiency
CluStream: A framework for clustering evolving data streams
Divide the clustering process into online and offline
components
Online component: periodically stores summary statistics
about the stream data
Offline component: answers various user questions based
on the stored summary statistics
Prof. Pier Luca Lanzi
56. The CluStream Framework 56
Micro-cluster
Statistical information about data locality
Temporal extension of the cluster-feature vector
• Multi-dimensional points X1 … Xk …
with time stamps T1 … Tk …
• Each point contains d dimensions, i.e., X = (x1 … xd)
• A micro-cluster for n points is defined as a (2.d + 3) tuple
(CF 2 , CF1 , CF 2 , CF1 , n)
x x t t
Pyramidal time frame
Decide at what moments the snapshots of the statistical
information are stored away on disk
Prof. Pier Luca Lanzi
57. CluStream: Pyramidal Time Frame 57
Snapshots of a set of micro-clusters are stored following the
pyramidal pattern
They are stored at differing levels of granularity depending
on the recency
Snapshots are classified into different orders
varying from 1 to log(T)
The i-th order snapshots occur at intervals
of αi where α ≥ 1
Only the last (α + 1) snapshots are stored
Prof. Pier Luca Lanzi
58. CluStream: Clustering On-line Streams 58
Online micro-cluster maintenance
Initial creation of q micro-clusters
• q is usually significantly larger than
the number of natural clusters
Online incremental update of micro-clusters
• If new point is within max-boundary, insert into the micro-cluster
• O.w., create a new cluster
• May delete obsolete micro-cluster or merge two closest ones
Query-based macro-clustering
Based on a user-specified time-horizon h and the number of
macro-clusters K, compute macroclusters using the k-means
algorithm
Prof. Pier Luca Lanzi
59. Stream Data Mining: 59
What are the Research Issues?
Mining sequential patterns in data streams
Mining partial periodicity in data streams
Mining notable gradients in data streams
Mining outliers and unusual patterns in data streams
Stream clustering
Multi-dimensional clustering analysis?
Cluster not confined to 2-D metric space, how to incorporate
other features, especially non-numerical properties
Stream clustering with other clustering approaches?
Constraint-based cluster analysis with data streams?
Prof. Pier Luca Lanzi
60. Summary: Stream Data Mining 60
Stream Data Mining is a rich and on-going research field
Current research focus in database community:
DSMS system architecture
Continuous query processing
Supporting mechanisms
Stream data mining and stream OLAP analysis
Powerful tools for finding general and unusual patterns
Effectiveness, efficiency and scalability:
lots of open problems
Philosophy on stream data analysis and mining
A multi-dimensional stream analysis framework
Time is a special dimension: Tilted time frame
What to compute and what to save?—Critical layers
Partial materialization and precomputation
Mining dynamics of stream data
Prof. Pier Luca Lanzi
61. References on Stream Data Mining (1) 61
C. Aggarwal, J. Han, J. Wang, P. S. Yu. A Framework for Clustering Data
Streams, VLDB'03
C. C. Aggarwal, J. Han, J. Wang and P. S. Yu. On-Demand Classification of
Evolving Data Streams, KDD'04
C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A Framework for Projected
Clustering of High Dimensional Data Streams, VLDB'04
S. Babu and J. Widom. Continuous Queries over Data Streams. SIGMOD
Record, Sept. 2001
B. Babcock, S. Babu, M. Datar, R. Motwani and J. Widom. Models and Issues
in Data Stream Systems”, PODS'02. (Conference tutorial)
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. quot;Multi-Dimensional
Regression Analysis of Time-Series Data Streams, VLDB'02
P. Domingos and G. Hulten, “Mining high-speed data streams”, KDD'00
A. Dobra, M. N. Garofalakis, J. Gehrke, R. Rastogi. Processing Complex
Aggregate Queries over Data Streams, SIGMOD’02
J. Gehrke, F. Korn, D. Srivastava. On computing correlated aggregates over
continuous data streams. SIGMOD'01
C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu. Mining frequent patterns in
data streams at multiple time granularities, Kargupta, et al. (eds.), Next
Generation Data Mining’04
Prof. Pier Luca Lanzi
62. References on Stream Data Mining (2) 62
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering Data
Streams, FOCS'00
G. Hulten, L. Spencer and P. Domingos: Mining time-changing data streams.
KDD 2001
S. Madden, M. Shah, J. Hellerstein, V. Raman, Continuously Adaptive
Continuous Queries over Streams, SIGMOD02
G. Manku, R. Motwani. Approximate Frequency Counts over Data Streams,
VLDB’02
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient Computation of Frequent
and Top-k Elements in Data Streams. ICDT'05
S. Muthukrishnan, Data streams: algorithms and applications, Proceedings
of the fourteenth annual ACM-SIAM symposium on Discrete algorithms,
2003
R. Motwani and P. Raghavan, Randomized Algorithms, Cambridge Univ.
Press, 1995
S. Viglas and J. Naughton, Rate-Based Query Optimization for Streaming
Information Sources, SIGMOD’02
Y. Zhu and D. Shasha. StatStream: Statistical Monitoring of Thousands of
Data Streams in Real Time, VLDB’02
H. Wang, W. Fan, P. S. Yu, and J. Han, Mining Concept-Drifting Data
Streams using Ensemble Classifiers, KDD'03
Prof. Pier Luca Lanzi