The document proposes a simple approach to solving a complex problem of computing unique visitors over time ranges that involves maintaining normalized and denormalized views of the data. The approach involves:
1) Storing all data in a master dataset and continuously recomputing indexes and views as a function of all the data to maintain normalized and denormalized views.
2) Querying both recent real-time views and historical batch views to retrieve the necessary data for a time range query, combining for high performance and accuracy.
3) Approximating unique counts for recent data by ignoring real-time equivalences to keep the real-time layer simple while still providing good query performance and eventual accuracy.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
Engaging users in real-time is the topic of our times. Whether it’s a game, a shop, or a content-network, the aim remains the same: providing a personalized experience. In this workshop we will look under the hood of Apache Storm and lay a firm foundation on how to use it with PHP. By that, you can leverage your existing codebase and PHP expertise for an entirely new world: real-time analytics and business logic working on message streams. During the course of the workshop, we will introduce Apache Storm and take a look at all of its components. We will then skyrocket the applicability of Storm by showing you how to implement their components with PHP. All exercises will be conducted using an example project, the infamous and most exhilarating lolcat kitten game ever conceived: Plan 9 From Outer Kitten. In order to follow the hands-on excercises, you will need a development VM prepared by us with all relevant system components and our project repositories. To make the workshop experience as smooth as possible for all participants, please bring a prepared computer to the workshop, as there will be no time to deal with installation and setup issues. Please download all prerequisites and install them as described: VM, Plan 9 webapp, Plan 9 storm backend, (Tutorial: https://github.com/DECK36/plan9_workshop_tutorial ).
Learning Stream Processing with Apache StormEugene Dvorkin
Over the last couple years, Apache Storm became a de-facto standard for developing real-time analytics and complex event processing applications. Storm enables to tackle real-time data processing challenges the same way Hadoop enables batch processing of Big Data. Storm enables companies to have "Fast Data" alongside with "Big Data". Some use cases where Storm can be used are Fraud Detection, Operation Intelligence, Machine Learning, ETL, Analytics, etc.
In this meetup, Eugene Dvorkin, Architect @WebMD and NYC Storm User Group organizer will teach Apache Storm and Stream Processing fundamentals. While this meeting is geared toward new Storm users, experienced users may find something interesting as well.
Following topics will be covered:
• Why use Apache Storm?
• Common use cases
• Storm Architecture - components, concepts, topology
• Building simple Storm topology with Java and Groovy
• Trident and micro-batch processing
• Fault tolerance and guaranteed message delivery
• Running and monitoring Storm in production
• Kafka
• Storm at WebMD
• Resources
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.
A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.
In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. As part of this presentation, we also released streamparse, a new Python that makes it easy to debug and run large Storm clusters.
Links:
* http://parse.ly/code
* https://github.com/Parsely/streamparse
* https://github.com/getsamsa/samsa
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Real-Time Big Data at In-Memory Speed, Using StormNati Shalom
Storm, a popular framework from Twitter, is used for real-time event processing. The challenge presented is how to manage the state of your real-time data processing at all times. In addition, you need Storm to integrate with your batch processing system (such as Hadoop) in a consistent manner.
This session will demonstrate how to integrate Storm with an in-memory database/grid, and explore various strategies for integrating the data grid with Hadoop and Cassandra, seamlessly. By achieving smooth integration with consistent management, you will be able to easily manage all the tiers of you Big Data stack in a consistent and effective way.
- See more at: http://nosql2013.dataversity.net/sessionPop.cfm?confid=74&proposalid=5526#sthash.FWIdqRHh.dpuf
Some of the biggest issues at the center of analyzing large amounts of data are query flexibility, latency, and fault tolerance. Modern technologies that build upon the success of “big data” platforms, such as Apache Hadoop, have made it possible to spread the load of data analysis to commodity machines, but these analyses can still take hours to run and do not respond well to rapidly-changing data sets.
A new generation of data processing platforms -- which we call “stream architectures” -- have converted data sources into streams of data that can be processed and analyzed in real-time. This has led to the development of various distributed real-time computation frameworks (e.g. Apache Storm) and multi-consumer data integration technologies (e.g. Apache Kafka). Together, they offer a way to do predictable computation on real-time data streams.
In this talk, we will give an overview of these technologies and how they fit into the Python ecosystem. As part of this presentation, we also released streamparse, a new Python that makes it easy to debug and run large Storm clusters.
Links:
* http://parse.ly/code
* https://github.com/Parsely/streamparse
* https://github.com/getsamsa/samsa
Distributed real time stream processing- why and howPetr Zapletal
In this talk you will discover various state-of-the-art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs, their intended use-cases, and how to choose between them. Petr will focus on the popular frameworks, including Spark Streaming, Storm, Samza and Flink. You will also explore theoretical introduction, common pitfalls, popular architectures, and much more.
The demand for stream processing is increasing. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications, include trading, social networks, the Internet of Things, and system monitoring, are becoming more and more important. A number of powerful, easy-to-use open source platforms have emerged to address this.
Petr's goal is to provide a comprehensive overview of modern streaming solutions and to help fellow developers with picking the best possible solution for their particular use-case. Join this talk if you are thinking about, implementing, or have already deployed a streaming solution.
This slides are for a brief seminar that I give in a Ph.D. exam "Perspective in Parallel Computing" (held by prof. Marco Danelutto) at University of Pisa (Italy).
They are a rapid introduction to Apache Storm and how it relates to classical algorithmic skeleton parallel frameworks
Presentation to a combined meetup of Bay Area Lisp and Bay Area Clojure groups. Presented three Clojure projects at BackType:
Cascalog - Batch processing in Clojure
ElephantDB - Database written in Clojure
Storm - Distributed, fault-tolerant, reliable stream processing and RPC
Slides from talk given at the NYC Cassandra Meetup. Discussing how Storm works and how it integrates well with Apache Cassandra.
There is also a segway into a example project that uses Storm and Cassandra to implement a scalable reactive web crawler.
http://github.com/tjake/stormscraper
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsData Con LA
This talk provides an overview of the open source Storm system for processing Big Data in realtime. The talk starts with an overview of the technology, including key components: Nimbus, Zookeeper, Topology, Tuple, Trident. It looks at integration with Hadoop through YARN and recent improvements. The presentation then dives into the complex Big Data architecture in which Storm can be integrated . The result is a compelling stack of technologies including integrated Hadoop clusters, MPP, and NoSQL databases.
After this, we look at example use cases for Storm: real-time advertising statistics, updating a Machine Learned model for content popularity predictions, and financial compliance monitoring.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
Big Data Day LA 2015 - Large Scale Distinct Count -- The HyperLogLog algorith...Data Con LA
"At OpenX we not only use the tools in big data ecosystems to solve our business problems, but also explore the cutting edge algorithms for practical uses. HyperLogLog is one of the algorithm that we use intensively in our internal system. It has really low computation cost and can easily plug into map-reduce framework (hadoop or spark). Some of the applications that worth to highlight are:
* high cardinality test
* distinct count of unique users over time
* Visualize hyperloglog for fraud detection"
Statistical analysis and mining of huge multi-terabyte data sets is a common task nowadays, especially in the areas like web analytics and Internet advertising. Analysis of such large data sets often requires powerful distributed data stores like Hadoop and heavy data processing with techniques like MapReduce. This approach often leads to heavyweight high-latency analytical processes and poor applicability to realtime use cases. On the other hand, when one is interested only in simple additive metrics like total page views or average price of conversion, it is obvious that raw data can be efficiently summarized, for example, on a daily basis or using simple in-stream counters. Computation of more advanced metrics like a number of unique visitor or most frequent items is more challenging and requires a lot of resources if implemented straightforwardly. In this article, I provide an overview of probabilistic data structures that allow one to estimate these and many other metrics and trade precision of the estimations for the memory consumption.
Probabilistic data structures. Part 2. CardinalityAndrii Gakhov
The book "Probabilistic Data Structures and Algorithms in Big Data Applications" is now available at Amazon and from local bookstores. More details at https://pdsa.gakhov.com
In the presentation, I described common data structures and algorithms to estimate the number of distinct elements in a set (cardinality), such as Linear Counting, HyperLogLog, and HyperLogLog++. Each approach comes with some math that is behind it and simple examples to clarify the theory statements.
Анализ количества посетителей на сайте [Считаем уникальные элементы]Qrator Labs
Конференция Highload++ / 7 ноября 2016 / Спикер - Константин Игнатов, инженер-разработчик в отделе исследований Qrator Labs.
Для точного ответа на вопрос, сколько уникальных посетителей было на моём сайте за произвольный интервал времени в прошлом, нужно через равные интервалы времени сохранять множество посетителей сайта (пусть это для простоты будут IP-адреса), которых мы за прошедший интервал увидели. Понятное дело, что такой объём информации хранить нереально, а даже, если получится, придётся объединять большое количество множеств и считать элементы в том множестве, которое получилось в итоге. Это очень долго. Не спасает ситуацию даже переход от точных алгоритмов к приблизительным: гарантировать точность либо не получится, либо придётся использовать объём памяти и вычислительные ресурсы, сопоставимые с точным алгоритмом.
Probabilistic Data Structures (Edmonton Data Science Meetup, March 2018)Kyle Davis
Let's explore how Redis (and Redis Enterprise) can be used to store data in not only deterministic structures but also probabilistic structures like Bloom filters, HyperLogLog, Count Min Sketch and Cuckoo filters. We examine both usage and briefly summarize the algorithms that back these structures. Also we review the use-cases and applications for probabilistic structures.
This is a high level presentation I delivered at BIWA Summit. It's just some high level thoughts related to today's NoSQL and Hadoop SQL engines (not deeply technical).
An application designer usually has to choose where to trade flexibility for specificity (and thus usually performance); knowing when and where to do so is an art and requires experience. This talk will share over a decades worth of experience making these decisions and the learnings from developing Pivotal's successful Real Time Intelligence (RTI) product using the latest versions of Spring projects: Integration, Data, Boot, MVC/REST and XD. A walk through the RTI architecture will provide the base for an explanation about how Spring performs at hundreds (and millions) of events/operations per second and the techniques that you can use right now in your own Spring applications to minimise resource utilisation and gain performance.
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
Think you have big data? What about high availability
requirements? At DataDog we process billions of data points every day including metrics and events, as we help the world
monitor the their applications and infrastructure. Being the world’s monitoring system is a big responsibility, and thanks to
Redis we are up to the task. Join us as we discuss how the DataDog team monitors and scales Redis to power our SaaS based monitoring offering. We will discuss our usage and deployment patterns, as well as dive into monitoring best practices for production Redis workloads
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld
VMworld 2013
Steve Flanders, VMware
Chengdu Huang, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
Why test automation is getting more difficult, and what can be done about it. This slides are from a presentation by Group Director, Product Management at TestPlant, Gordon McKeown, which was presented at the Northern Lights conference in Manchester in April 2016.
The Secrets of Building Realtime Big Data Systemsnathanmarz
The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.
Presented at POSSCON '11.
Presentation I gave at Bay Area Clojure Meetup Group on May 6th, 2010. Also demoed examples from introductory tutorial:
http://nathanmarz.com/blog/introducing-cascalog/
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
8. Notes:
• Not limiting ourselves to current tooling
• Reasonable variations of existing tooling
are acceptable
• Interested in what’s fundamentally possible
10. Approach #1
• Use Key->Set database
• Key = [URL, hour bucket]
• Value = Set of UserIDs
11. Approach #1
• Queries:
• Get all sets for all hours in range of
query
• Union sets together
• Compute count of merged set
12. Approach #1
• Lot of database lookups for large ranges
• Potentially a lot of items in sets, so lots of
work to merge/count
• Database will use a lot of space
15. Approach #2
• Use Key->HyperLogLog database
• Key = [URL, hour bucket]
• Value = HyperLogLog structure
16. Approach #2
• Queries:
• Get all HyperLogLog structures for all
hours in range of query
• Merge structures together
• Retrieve count from merged structure
17. Approach #2
• Much more efficient use of storage
• Less work at query time
• Mild accuracy tradeoff
19. Approach #3
• Use Key->HyperLogLog database
• Key = [URL, bucket, granularity]
• Value = HyperLogLog structure
20. Approach #3
• Queries:
• Compute minimal number of database
lookups to satisfy range
• Get all HyperLogLog structures in range
• Merge structures together
• Retrieve count from merged structure
21. Approach #3
• All benefits of #2
• Minimal number of lookups for any range,
so less variation in latency
• Minimal increase in storage
• Requires more work at write time
31. Approach #1
• [URL, hour] -> Set of PersonIDs
• UserID -> Set of buckets
• Indexes to incrementally normalize
UserIDs into PersonIDs
32. Approach #1
• Getting complicated
• Large indexes
• Operations require a lot of work
33. Approach #2
• [URL, bucket] -> Set of UserIDs
• Like Approach 1, incrementally normalize
UserId’s
• UserID -> PersonID
34. Approach #2
• Query:
• Retrieve all UserID sets for range
• Merge sets together
• Convert UserIDs -> PersonIDs to
produce new set
• Get count of new set
36. Attempt 1:
• Maintain index from UserID -> PersonID
• When receive A <-> B:
• Find what they’re each normalized to,
and transitively normalize all reachable
IDs to “smallest” val
38. Attempt 2:
• UserID -> PersonID
• PersonID -> Set of UserIDs
• When receive A <-> B
• Find what they’re each normalized to, and
choose one for both to be normalized to
• Update all UserID’s in both normalized sets
41. General challenges with
traditional
architectures
• Redundant storage of information
(“denormalization”)
• Brittle to human error
• Operational challenges of enormous
installations of very complex databases
48. ID Name
Location
ID
1 Sally 3
2 George 1
3 Bob 3
Location
ID
City State Population
1 New York NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Normalized schema
Normalization vs
Denormalization
50. ID Name Location ID City State
1 Sally 3 Chicago IL
2 George 1 New York NY
3 Bob 3 Chicago IL
Location ID City State Population
1 New York NY 8.2M
2 San Diego CA 1.3M
3 Chicago IL 2.7M
Denormalized schema
67. Conclusions
• Easy to understand and implement
• Scalable
• Concurrency / fault-tolerance easily
abstracted away from you
• Great query performance
73. Approach #1
• Use the exact same approach as we did in
fully incremental implementation
• Query performance only degraded for
recent buckets
• e.g., “last month” range computes vast
majority of query from efficient batch
indexes
74. Approach #1
• Relatively small number of buckets in
realtime layer
• So not that much effect on storage costs
75. Approach #1
• Complexity of realtime layer is softened by
existence of batch layer
• Batch layer continuously overrides realtime
layer, so mistakes are auto-fixed
76. Approach #1
• Still going to be a lot of work to implement
this realtime layer
• Recent buckets with lots of uniques will
still cause bad query performance
• No way to apply recent equivs to batch
views without restructuring batch views
89. Incremental compaction
• Databases write to write-ahead log before
modifying disk and memory indexes
• Need to occasionally compact the log and
indexes
93. Incremental compaction
• Notorious for causing huge, sudden
changes in performance
• Machines can seem locked up
• Necessitated by random writes
• Extremely complex to deal with
94. More Complexity
• Dealing with CAP / eventual consistency
• “Call Me Maybe” blog posts found data loss
problems in many popular databases
• Redis
• Cassandra
• ElasticSearch
101. Lambda Architecture
• This is most basic form of it
• Many variants of it incorporating more
and/or different kinds of layers
Editor's Notes
clear up confusion around it. lambda architecture addresses a lot of nasty, fundamental complexities that isn’t talked about enough
most of talk won’t even talk about LA, we’ll work on an example problem and you’ll see LA naturally emerge
this isn’t even capable of solving the problem we’re going to look at
i want this talk to be interactive...
going deep into technical details
please do not hesitate to jump in with any questions
uniques for just hour 1 = 3
uniques for hours 1 and 2 = 3
uniques for 1 to 3 = 5
uniques for 2-4 = 4
synchronous
asynchronous
characterized by maintaining state incrementally as data comes in and serving queries off of that same state
1 KB to estimate size up to 1B with only 2% error
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t require fetching the entire set
it’s not a stretch to imagine a database that can do hyperloglog natively, so updates don’t require fetching the entire set
example: 1 month there are ~720 hours, 30 days, 4 weeks, 1 month... adding all granularities makes 755 stored values total instead of 720 values, only a 4.8% increase in storage
except now userids should be normalized, so if there’s equiv that user only appears once even if under multiple ids
equiv can change ANY or ALL buckets in the past
will get back to incrementally updating userids
will get back to incrementally updating userids
offload a lot of the work to read time
this is still a lot of work at read time
overall
if using distributed database to store indexes and computing everything concurrently
when receive equivs for 4&lt;-&gt;3 and 3&lt;-&gt;1 at same time, will need some sort of locking so they don’t step on each other
e.g. granularities, the 2 indexes for user id normalization... we know it’s a bad idea to store the same thing in multiple places... opens up possibility of them getting out of sync if you don’t handle every case perfectly
If you have a bug that accidentally sets the second value of all equivs to 1, you’re in trouble
even the version without equivs suffers from these problems
2 functions: produce water of a certain strength, and produce water of a certain temperature
faucet on left gives you “hot” and “cold” inputs which each affect BOTH outputs - complex to use
faucet on right gives you independent “heat” and “strength” inputs, so SIMPLE to use
neither is very complicated
so just a quick overview of denormalization, here’s a schema that stores user information and location information
each is in its own table, and a user’s location is a reference to a row in the location table
this is pretty standard relational database stuff
now let’s say a really common query is getting the city and state a person lives in
to do this you have to join the tables together as part of your query
you might find joins are too expensive, they use too many resources
so you denormalize the schema for performance
you redundantly store the city and state in the users table to make that query faster, cause now it doesn’t require a join
now obviously, this sucks. the same data is now stored in multiple places, which we all know is a bad idea
whenever you need to change something about a location you need to change it everywhere it’s stored
but since people make mistakes, inevitably things become inconsistent
but you have no choice, you want to normalize, but you have to denormalize for performance
i hope you are looking at this and asking the question...
still have to compute uniques over time and deal with the equivs problem
how are we better off than before?
options for taking different approaches to problem without having to sacrifice too much
people say it does “key/value”, so I can use it when I need key/value operations... and they stop there
can’t treat it as a black box, that doesn’t tell the full story
some of his tests was seeing over 30% data loss during partitions
major operational simplification to not require random writes
i’m not saying you can’t make a database that does incremental compaction and deals with the other complexities of random writes well, but it’s clearly a fundamental complexity, and i feel it’s better to not have to deal with it at all
remember, we’re talking about what’s POSSIBLE, not what currently exists
my experience with elephantdb
Does not avoid any of the complexities of massive distributed r/w databases
Does not avoid any of the complexities of massive distributed r/w databases or dealing with eventual consistency
everything i’ve talked about completely generalizes, applies to both AP and CP architectures