This document discusses distributed algorithms for big data. It begins with an overview of HyperLogLog for estimating cardinality and counting distinct elements in a large data set. It then explains how HyperLogLog works by using a hash function to distribute the data across buckets and applying the LogLog algorithm to each bucket before taking the harmonic mean. The document also covers Paxos for distributed consensus, explaining the phases of prepare, promise, accept and learn to reach agreement in the presence of failures.
Data Wars: The Bloody Enterprise strikes backVictor_Cr
I would like to describe such cases when we create problems for "future us" just by an accident. I will show how different Java data types can ease or increase the pain in supporting the application later. Most common pitfals and tricky corner cases you probably have never thought about.
Presented at JAX London 2013
Clojure is the most interesting new language on the horizon, but many developers suffer from the Blub Paradox when they see the Lisp syntax. This talk introduces Clojure to developers who haven’t been exposed to it yet, focusing on the things that truly set it apart from other languages.
Programming with Millions of Examples (HRL)Eran Yahav
In a world where programming is largely based on using APIs, semantic code search emerges as a way to effectively learn how such APIs should be used. Towards this end, we present a formal framework for static specification mining that is able to handle code snippets and incomplete programs. Our framework analyzes code snippets and extracts partial temporal specifications. Technically, partial temporal specifications are represented as symbolic automata – automata where transitions may be labeled by variables, and a variable can be substituted by a letter, a word, or a regular language. With the help of symbolic automata, the use of the API is extracted from each snippet of code, and the many separate examples are consolidated to create a full(er) usage scenario database that can be queried. We have implemented our approach in a tool called PRIME and applied it to analyze and consolidate thousands of snippets per tested API.
This talk is based on work with Alon Mishne, Sharon Shoham, Eran Yahav, and Hongseok Yang.
Want to learn about PrefixSpan for Sequential Pattern Mining and PrefixSpan Implementations with Spark? Then have a look at this presentation and learn how Akanoo uses cutting edge technology to predict onlineshopper behavior in real-time. Akanoo uses smart algorithms to target and convert visitors while they are still surfing the shop site for high conversion rates and additional revenue.
Data Wars: The Bloody Enterprise strikes backVictor_Cr
I would like to describe such cases when we create problems for "future us" just by an accident. I will show how different Java data types can ease or increase the pain in supporting the application later. Most common pitfals and tricky corner cases you probably have never thought about.
Presented at JAX London 2013
Clojure is the most interesting new language on the horizon, but many developers suffer from the Blub Paradox when they see the Lisp syntax. This talk introduces Clojure to developers who haven’t been exposed to it yet, focusing on the things that truly set it apart from other languages.
Programming with Millions of Examples (HRL)Eran Yahav
In a world where programming is largely based on using APIs, semantic code search emerges as a way to effectively learn how such APIs should be used. Towards this end, we present a formal framework for static specification mining that is able to handle code snippets and incomplete programs. Our framework analyzes code snippets and extracts partial temporal specifications. Technically, partial temporal specifications are represented as symbolic automata – automata where transitions may be labeled by variables, and a variable can be substituted by a letter, a word, or a regular language. With the help of symbolic automata, the use of the API is extracted from each snippet of code, and the many separate examples are consolidated to create a full(er) usage scenario database that can be queried. We have implemented our approach in a tool called PRIME and applied it to analyze and consolidate thousands of snippets per tested API.
This talk is based on work with Alon Mishne, Sharon Shoham, Eran Yahav, and Hongseok Yang.
Want to learn about PrefixSpan for Sequential Pattern Mining and PrefixSpan Implementations with Spark? Then have a look at this presentation and learn how Akanoo uses cutting edge technology to predict onlineshopper behavior in real-time. Akanoo uses smart algorithms to target and convert visitors while they are still surfing the shop site for high conversion rates and additional revenue.
DevFest Istanbul - a free guided tour of Neo4JFlorent Biville
2013-11-02 : DevFest Türkiye, Istanbul.
Slightly modified version of my previous Neo4J introduction talk about Neo4J in Soft-Shake Event, Geneva, Switzerland.
Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.
zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelAlex Pruden
Slides for Eylon Yogev's (Bar-Ilan University) presentation at ZKStudyClub, covering his new work (co-authored w/ Alessandro Chiesa of UC Berkeley) about SNARGs in the random oracle model of sub- quadratic complexity.
Link to the original paper: https://eprint.iacr.org/2021/281.pdf
Video: https://www.youtube.com/watch?v=SfrEThI_m7g
Source code: https://github.com/Alotor/2015-greach-groovy-dsls
Behind each good Groovy library or framework there is a good DSL (Domain Specific Language). And this is not by chance, one of the most exciting features of Groovy is its amazing syntax flexibility and metaprogramming capabilities that allow us do things in a highly expressive manner through DSLs.
In this talk I’ll explain the basics of doing DLS’s with Groovy. What you’ll need to start and what to investigate deeper. Also, we’ll check some of the most well known ones libraries like Spock, Gradle or Grails so you can use their techniques in your own Groovy projects.
Shai Halevi discusses new ways to protect cloud data and security. Presented at "New Techniques for Protecting Cloud Data and Security" organized by the New York Technology Council.
Video available here: http://vivu.tv/portal/archive.jsp?flow=783-586-4282&id=1270584002677
We all know that MongoDB is one of the most flexible and feature-rich databases available. In this webinar we'll discuss how you can leverage this feature set and maintain high performance with your project's massive data sets and high loads. We'll cover how indexes can be designed to optimize the performance of MongoDB. We'll also discuss tips for diagnosing and fixing performance issues should they arise.
This week, Luke Pearson (Polychain Capital) and Joshua Fitzgerald (Anoma) present their work on Plonkup, a protocol that combines Plookup and PLONK into a single, efficient protocol. The protocol relies on a new hash function, called Reinforced Concrete, written by Dmitry Khovratovich. The three of them will present their work together at this week's edition of zkStudyClub!
Slides:
---
To Follow the Zero Knowledge Podcast us at https://www.zeroknowledge.fm
To the listeners of Zero Knowledge Podcast, if you like what we do:
- Follow us on Twitter - @zeroknowledgefm
- Join us on Telegram - https://t.me/joinchat/TORo7aknkYNLHmCM
- Support our Gitcoin Grant - https://gitcoin.co/grants/329/zero-knowledge-podcast-2
- Support us on Patreon - https://www.patreon.com/zeroknowledge
DevFest Istanbul - a free guided tour of Neo4JFlorent Biville
2013-11-02 : DevFest Türkiye, Istanbul.
Slightly modified version of my previous Neo4J introduction talk about Neo4J in Soft-Shake Event, Geneva, Switzerland.
Dr. Hsieh is teaching how to use the state-of-the-art libraries, Spark by Apache, to conduct data analysis on hadoop platform in ISSNIP 2015, Singapore. He started with teaching the basic operations like “map, reduce, flatten, and more,” followed by explaining the extension of Spark, including MLib, GraphX, and SparkSQL.
zkStudy Club: Subquadratic SNARGs in the Random Oracle ModelAlex Pruden
Slides for Eylon Yogev's (Bar-Ilan University) presentation at ZKStudyClub, covering his new work (co-authored w/ Alessandro Chiesa of UC Berkeley) about SNARGs in the random oracle model of sub- quadratic complexity.
Link to the original paper: https://eprint.iacr.org/2021/281.pdf
Video: https://www.youtube.com/watch?v=SfrEThI_m7g
Source code: https://github.com/Alotor/2015-greach-groovy-dsls
Behind each good Groovy library or framework there is a good DSL (Domain Specific Language). And this is not by chance, one of the most exciting features of Groovy is its amazing syntax flexibility and metaprogramming capabilities that allow us do things in a highly expressive manner through DSLs.
In this talk I’ll explain the basics of doing DLS’s with Groovy. What you’ll need to start and what to investigate deeper. Also, we’ll check some of the most well known ones libraries like Spock, Gradle or Grails so you can use their techniques in your own Groovy projects.
Shai Halevi discusses new ways to protect cloud data and security. Presented at "New Techniques for Protecting Cloud Data and Security" organized by the New York Technology Council.
Video available here: http://vivu.tv/portal/archive.jsp?flow=783-586-4282&id=1270584002677
We all know that MongoDB is one of the most flexible and feature-rich databases available. In this webinar we'll discuss how you can leverage this feature set and maintain high performance with your project's massive data sets and high loads. We'll cover how indexes can be designed to optimize the performance of MongoDB. We'll also discuss tips for diagnosing and fixing performance issues should they arise.
This week, Luke Pearson (Polychain Capital) and Joshua Fitzgerald (Anoma) present their work on Plonkup, a protocol that combines Plookup and PLONK into a single, efficient protocol. The protocol relies on a new hash function, called Reinforced Concrete, written by Dmitry Khovratovich. The three of them will present their work together at this week's edition of zkStudyClub!
Slides:
---
To Follow the Zero Knowledge Podcast us at https://www.zeroknowledge.fm
To the listeners of Zero Knowledge Podcast, if you like what we do:
- Follow us on Twitter - @zeroknowledgefm
- Join us on Telegram - https://t.me/joinchat/TORo7aknkYNLHmCM
- Support our Gitcoin Grant - https://gitcoin.co/grants/329/zero-knowledge-podcast-2
- Support us on Patreon - https://www.patreon.com/zeroknowledge
Algorithmic Data Science = Theory + PracticeTwo Sigma
Obtaining actionable insights from large datasets requires the use methods that must be, at once, fast, scalable, and statistically sound. This is the field of study of algorithmic data science, a discipline at the border of computer science and statistics. In this talk I outline the fundamental questions that motivate research in this area, present a general framework to formulate many problems in this field, introduce the challenges in balancing theoretical and statistical correctness with practical efficiency, and I show how sampling-based algorithms are extremely effective at striking the correct balance in many situations, giving examples from social network analysis and pattern mining. I will conclude with some research directions and areas for future explorations.
Neural networks for word embeddings have received a lot of attention since some Googlers published word2vec in 2013. They showed that the internal state (embeddings) that the neural network learned by "reading" a large corpus of text preserved semantic relations between words.
As a result, this type of embedding started being studied in more detail and applied to more serious Natural Language Processing + NLP and IR tasks such as summarization, query expansion, etc...
In this talk we will cover the intuitions and algorithms underlying word2vec family of algorithms. On the second half of the presentation we will quickly review than basics of tensorflow and analyze in detail the tensorflow reference implementation of word2vec
Flink Forward Berlin 2017: David Rodriguez - The Approximate Filter, Join, an...Flink Forward
In this talk we introduce the notion of approximate filter, join, and groupby operations for arrays. Typically, Flink streams contain primitive types and tuples where filter, join, and groupby operate on exact matches. But, exact matches are sometimes limiting. For example, the objects Array(100, 0, 100) and Array(100, 0, 101) may be “close enough” to match. To solve this problem, we introduce locality sensitive hashing (LSH) for arrays of numeric and string types. This technique encodes arrays into strings so that similar arrays are encoded to the same string. In other words, we ensure matching when arrays are similar, up to a degree of error. Therefore, it is easy to incorporate new approximate filter, join, and groupby design patterns built on the notion of exact matches. In conclusion, we highlight how Cisco Umbrella streams large signals stored in arrays and then clusters them using approximate filter, join, groupby methods to detect waves of botnets and cybercrime online.
(Bill Bejeck, Confluent) Kafka Summit SF 2018
Apache Kafka added a powerful stream processing library in mid-2016, Kafka Streams, which runs on top of Apache Kafka. The community has embraced Kafka Streams with many early adopters, and the adoption rate continues to grow. Large to mid-size organizations have come to rely on Kafka Streams in their production environments. Kafka Streams has many advanced features to make applications more robust.
The point of this presentation is to show users of Kafka Streams some of the latest and greatest features, as well as some that may be advanced, that can make streams applications more resilient. The target audience for this talk are those users already comfortable writing Kafka Streams applications and want to go from writing their first proof-of-concept applications to writing robust applications that can withstand the rigor that running in a production environment demands.
The talk will be a technical deep dive covering topics like:
-Best practices on configuring a Kafka Streams application
-How to meet production SLAs by minimizing failover and recovery times: configuring standby tasks and the pros and cons of having standby replicas for local state
-How to improve resiliency and 24×7 operability: the use of different configurable error handlers, callbacks and how they can be used to see what’s going on inside the application
-How to achieve efficient scalability: a thorough review of the relationship between the number of instances, threads and state stores and how they relate to each other
While this is a technical deep dive, the talk will also present sample code so that attendees can view the concepts discussed in practice. Attendees of this talk will walk away with a deeper understanding of how Kafka Streams works, and how to make their Kafka Streams applications more robust and efficient. There will be a mix of discussion.
Practical and Worst-Case Efficient ApportionmentRaphael Reitzig
Proportional apportionment is the problem of assigning seats to parties according to their relative share of votes. Divisor methods are the de-facto standard solution, used in many countries.
In recent literature, there are two algorithms that implement divisor methods: one by Cheng and Eppstein (ISAAC, 2014) has worst-case optimal running time but is complex, while the other (Pukelsheim, 2014) is relatively simple and fast in practice but does not offer worst-case guarantees.
This talk presents the ideas behind a novel algorithm that avoids the shortcomings of both. We investigate the three contenders in order to determine which is most useful in practice.
Read more over here: http://reitzig.github.io/publications/RW2015b
How to create and solve finite element models?
Application to 2nd Order Differential Equations!
#WikiCourses #FEM
https://wikicourses.wikispaces.com/TopicX+Element+Equations
Similar to Distributed algorithms for big data @ GeeCon (20)
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
JMeter webinar - integration with InfluxDB and GrafanaRTTS
Watch this recorded webinar about real-time monitoring of application performance. See how to integrate Apache JMeter, the open-source leader in performance testing, with InfluxDB, the open-source time-series database, and Grafana, the open-source analytics and visualization application.
In this webinar, we will review the benefits of leveraging InfluxDB and Grafana when executing load tests and demonstrate how these tools are used to visualize performance metrics.
Length: 30 minutes
Session Overview
-------------------------------------------
During this webinar, we will cover the following topics while demonstrating the integrations of JMeter, InfluxDB and Grafana:
- What out-of-the-box solutions are available for real-time monitoring JMeter tests?
- What are the benefits of integrating InfluxDB and Grafana into the load testing stack?
- Which features are provided by Grafana?
- Demonstration of InfluxDB and Grafana using a practice web application
To view the webinar recording, go to:
https://www.rttsweb.com/jmeter-integration-webinar
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. Who Am I ?
Duy Hai DOAN
Cassandra technical advocate
• talks, meetups, confs
• open-source devs (Achilles, …)
• OSS Cassandra point of contact
☞ duy_hai.doan@datastax.com
☞ @doanduyhai
@doanduyhai2
14. LogLog algorithm(simplified)
1) Choose a very distributive hash function H
2) For each incoming element in the data set (article_id, login,
uuid…), apply H
3) Convert the hash into binary sequence
4) estimate the cardinality by observing the binary sequences
@doanduyhai14
0111010010101…
0010010010001…
1010111001100…
…
15. LogLog intuition
Uniform probability:
50% of the bit sequences start with 0xxxxx
50% of the bit sequences start with 1xxxxx
1/4 of the bit sequences start with 00xxxxx
1/4 of the bit sequences start with 01xxxxx
1/4 of the bit sequences start with 10xxxxx
1/4 of the bit sequences start with 11xxxxx
@doanduyhai15
16. LogLog intuition
Look for the position r of the 1st bit set to 1 starting from the left
000000001xxxx à r = 9
0001xxxxxxxxx à r = 4
000001xxxxxxx à r = 6
@doanduyhai16
000000…0001xxxxxxx
rank r
17. LogLog intuition
There are 2r combinations of r - length bit sequences
000…0001, 000…0010, 000…0011,…, 111…1111
@doanduyhai17
000000…0001xxxxxxx
rank r
18. LogLog intuition
Uniform probability:
1/2r of the bit sequences start with 000000…0001xxx
1/2r of the bit sequences start with 000000…0010xxx
…
1/2r of the bit sequences start with 111111…1111xxx
@doanduyhai18
20. @doanduyhai20
Reversing the logic
I have as many chance
to observe 000000…0001xxx
than to observe 000000…0010xxx
than to observe 000000…0011xxx
etc…
21. @doanduyhai21
Reversing the logic
If I have observed 000000…0001xxx
I should probably observe 000000…0010xxx
and probably observe 000000…0011xxx
etc…
27. HyperLogLog idea
1) Eliminate and smooth out outlying elements
☞ harmonic mean
@doanduyhai27
H =
n
1
x1
+
1
x2
+...+
1
xn
Credits: Wikipedia
28. HyperLogLog idea
Example, harmonic mean of 3, 6, 7, 2 and 120
Arithmetic mean = 51 …
@doanduyhai28
H =
5
1
3
+
1
6
+
1
7
+
1
12
+
1
120
≈ 6.80
29. HyperLogLog idea
2) Distribute the computation (« divide and conquer »)
☞ apply LogLog to n buckets
p = prefix length (here 6)
buckets count = 2p (here 64)
@doanduyhai29
101101000xxxxxxx
p bits
30. HyperLogLog idea
2) Distribute the computation (« divide and conquer »)
@doanduyhai30
000000xxxx
Input data stream
B1 B2 B3 B4 B63 B64B62B61… …
000001xxxx 000010xxxx 000011xxxx 111100xxxx 111101xxxx 111110xxxx 111111xxxx
32. HyperLogLog formula
For each bucket i, we compute the cardinality estimate for this
bucket, Mi
Mi ≈ 2max(ri)
max(ri) = max rank found in bucket Mi
@doanduyhai32
33. HyperLogLog formula
Harmonic mean H(Mi) computed on all Mi, by definition
H(Mi) ≈ n/b
n = global cardinality estimate (what we look for)
b = number of buckets
☞ n ≈ b • H(Mi)
@doanduyhai33
34. HyperLogLog, the maths
@doanduyhai34
H(xi ) =
b
1
x1
+
1
x2
+...+
1
xb
= b
1
1
xi
i=1
b
∑
"
#
$
$
$
$
%
&
'
'
'
'
H(xi ) = b
1
xi
i=1
b
∑
"
#
$
$
%
&
'
'
−1
= b xi
−1
i=1
b
∑
"
#
$
%
&
'
−1
35. HyperLogLog, the maths
We replace the xi in the previous formula by Mi
Then we replace the Mi in the formula by 2max(ri)
@doanduyhai35
H(Mi ) = b Mi
−1
i=1
b
∑( )
−1
H(Mi ) = b 2i
−max(ri )
i=1
b
∑
#
$
%
&
'
(
−1
36. HyperLogLog, the maths
Inject H(Mi) into the formula for cardinality estimate: n ≈ b・H(Mi)
@doanduyhai36
n ≈ αbb2
2−max(ri )
i=1
b
∑
$
%
&
'
(
)
−1
n = cardinality estimate
b = buckets count
𝛼b = corrective constant
max rank observed
in each bucket
39. Which use-cases ?
Nb of unique visitors on high traffic web site
Nb of unique clicks on popular articles/items
TopN elements (visitors, items …)
…
@doanduyhai39
40. Some real-world implementations
Apache Cassandra: distributed table size estimate
Redis: out-of-the-box data structure
DataFu (Apache Pig): standard UDF
Twitter Algebird: algorithms lib for Storm & Scalding
@doanduyhai40
48. Paxos phase 1: prepare
n = sequence number
@doanduyhai48
Proposer
Acceptor
Client Acceptor
Acceptor
Acceptor
Acceptor
prepare(n)
prepare(n)
prepare(n)
Ask for consensus
on value val
prepare(n)
prepare(n)
52. Paxos phase 2.5: learn
@doanduyhai52
Proposer
Acceptor
Client Acceptor
Acceptor
Acceptor
Acceptor
store val
learner = durable storage
Learner
Learner
Learner
store val
store val
53. Paxos phase 1: prepare
The proposer:
• picks an monotonically increasing (timeuuid) sequence number n
• sends prepare(n) to all acceptors
@doanduyhai53
Proposer Acceptor
prepare(n)
54. Each acceptor, upon receiving a prepare(n):
• if it has not accepted(m,?) OR promise(m,valm) with m > n
☞ return promise(n,∅), store n locally
☞ promise not to accept any prepare(o) or accept(o,?)
with o < n
Paxos phase 1: promise
@doanduyhai54
Proposer Acceptor
promise(n,∅)
n,∅
55. Paxos phase 1: promise
Each acceptor, upon receiving a prepare(n):
• if it has already sent an accepted(m,valm) with m < n
☞ return promise(m,valm)
@doanduyhai55
Proposer Acceptor
promise(m,valm)
m,valm
56. Paxos phase 1: promise
Each acceptor, upon receiving a prepare(n):
• if it has accepted(m,?) OR promise(m,?) with m > n
☞ ignore OR return Nack (optimization)
@doanduyhai56
Proposer Acceptor
Nack
57. Paxos phase 1 objectives
• discover any pending action to make it progress
• block old proposal(s) that are stalled
Proposer asks for plebiscit (prepare)
Acceptors grant allegiance (promise)
@doanduyhai57
Proposer Acceptor
Who’s the boss ?
You sir!
58. Paxos phase 2: accept
The proposer receives a quorum of promise(mi,valmi
)
• if all promises are promise(n, ∅) then send accept(n,valn)
• otherwise, take the valmi
of the biggest mi and send
accept(n,valmax(mi)
)
@doanduyhai58
Proposer Acceptor
accept(n,valmax(mi))
OR
accept(n,valn)
59. Paxos phase 2: accepted
Each acceptor, upon receiving a accept(n,val):
• if it has not made any promise(m,?) m > n
☞ return accepted(n,val), store val locally
• else, ignore the request
@doanduyhai59
Proposer Acceptor
accepted(n,val)
n,val
60. Paxos phase 2.5: learn
The proposer receives a quorum of accepted(n,val)
• send val to the learners (durable storage)
The consensus is found and its value is val
This defines a round of Paxos
@doanduyhai60
Proposer
store val
Learner
61. Paxos phase 2 objectives
• commit any pending proposal
• learn the consensus value
Proposer issues a proposal (accept)
Acceptors accept the proposal (accepted)
@doanduyhai61
Proposer Acceptor
Accept this !
Yes sir!
62. Formal Paxos limits
• once a consensus val is reached, we can’t change it!
• needs to reset val for another Paxos round
Multi-Paxos
• many rounds of Paxos in //, impacting different partitions
• each server can be Proposer, Acceptor & Learner
Fast-Paxos, Egalitarian-Paxos, etc …
@doanduyhai62
63. Conflict cases
Failure of a minority of acceptors
@doanduyhai63
a1
a2
a3
a4
a5
prepare(n1)
prepare(n1)
prepare(n1)
prepare(n1)
prepare(n1)
Legend
received message
sent message
promise(∅)
promise(∅)
promise(∅)
promise(∅)
promise(∅)
accept(n1,a)
accept(n1,a)
☠
☠
accepted(a)
accepted(a)
accept(n1,a) accepted(a)
✔︎
68. Which use-cases ?
Reliable master election for master/slave architectures
Distributed consensus
Distributed Compare & Swap algorithm
Distributed lock
@doanduyhai68
69. Some real-world implementations
Apache Cassandra: light weight transaction
Google Chubby/Spanner: lock/distributed transactions
Heroku: via Doozerd for distributed configuration data
Neo4j(≥1.9): replaces Apache Zookeeper for high availablity
@doanduyhai69