This document summarizes Kazuho Oku's presentation on running a real-time stats service on MySQL. Some key points:
1) Oku described Pathtraq, a web ranking service in Japan that collects over 1 million access records per day from 10,000 users.
2) To provide real-time analysis of this data, compressed tables are stored in RAM to avoid slow random access on HDD. Custom compression algorithms were developed to compress URLs and access stats.
3) Additional optimizations included creating a message queue, limiting pre-computation loads, and developing an in-memory cache system with locking to minimize database queries.
Running a Realtime Stats Service on MySQLKazuho Oku
Slides used at Percona Performance Conference. Describes the optimizations / tweeks used on running pathtraq.com, one of Japan's largest web stats service.
Running a Realtime Stats Service on MySQLKazuho Oku
Slides used at Percona Performance Conference. Describes the optimizations / tweeks used on running pathtraq.com, one of Japan's largest web stats service.
Scalable Relational Databases with Amazon Aurora. Madrid Summit 2019javier ramirez
Amazon Aurora is a MySQL/PostgreSQL compatible managed database created by Amazon Web Services. Apart from being a managed service, it has better performance, better availability, and some nice extra features. Learn in this presentation some of the internals that make Aurora possible.
Gridify your Spring application with Grid Gain @ Spring Italian Meeting 2008Sergio Bossa
Cheaper hardware and highly demanding applications make nowadays scalability a strong requirement: what will you say when your Boss will complain about more and more users waiting for that long task to complete before committing their transaction?
So take your application and make it scale with the Spring Framework, the leading full-stack solution for your Java applications, and Grid Gain, the most powerful Open Source production-ready grid computing framework!
In this talk you will learn about scalability principles, the
Map/Reduce pattern and how they\'re applied in Grid Gain for scaling out your Spring application.
Similar to Running A Realtime Stats Service On My Sql (20)
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
2. Background
2
Apr. 23 2009 Running Realtime Stats Service on MySQL
3. Who am I?
Name: Kazuho Oku ( )
Original Developer of Palmscape / Xiino
The oldest web browser for Palm OS
Working at Cybozu Labs since 2005
Research subsidiary of Cybozu, Inc.
Cybozu is a leading groupware vendor in Japan
My weblog: tinyurl.com/kazuho
3
Apr. 23 2009 Running Realtime Stats Service on MySQL
5. What is Pathtraq?
Started in Aug. 2007
Web ranking service
One of Japan’s largest
10,000 users submit access information
1,000,000 access infomation per day
like Alexa, but semi-realtime, and per-page
5
Apr. 23 2009 Running Realtime Stats Service on MySQL
6. What is Pathtraq? (cont'd)
Automated Social News Service
find what's hot
like Google News + Digg
calculate relevance from access stats
Search by...
no filtering (all the Internet)
by category
by keyword
by URL (per-domain, etc.)
6
Apr. 23 2009 Running Realtime Stats Service on MySQL
7.
8.
9. How to Provide Real-time Analysis?
Data Set (as of Apr. 23 2009)
# of URLs: 147,748,546
# of total accesses: 413,272,527
Sharding is not a good option
since we need to join the tables and aggregate
prefix-search by URL, search by keyword, then join
with access data table
core tables should be stored on RAM
not on HDD, due to lots of random access
9
Apr. 23 2009 Running Realtime Stats Service on MySQL
10. Our Decision was to...
Keep URL and access stats on RAM
compression for size and speed
Create a new message queue
Limit Pre-computation Load
Create our own cache, with locks
to minimize database access
Fulltext-search database on SSD
10
Apr. 23 2009 Running Realtime Stats Service on MySQL
11. Our Servers
Main Server
Opteron 2218 x2, 64GB Mem
MySQL, Apache
Fulltext Search Server
Opteron 240EE, 2GB Mem, Intel SSD
MySQL (w. Tritonn/Senna)
Helper Servers
for Content Analysis
for Screenshot Generation
11
Apr. 23 2009 Running Realtime Stats Service on MySQL
12. The Long Tail of the Internet
-0.44
y=C x
# of URLs with 1/10 hits: x2.75
12
Apr. 23 2009 Running Realtime Stats Service on MySQL
13. Compressing URLs
13
Apr. 23 2009 Running Realtime Stats Service on MySQL
14. Compressing URLs
The Challenges:
URLs are too short for gzip, etc.
URLs should be prefix-searchable in compressed
form
How to run like 'http://www.mysql.com/%' on a
compressed URL?
The Answer:
Static PPM + Range Coder
14
Apr. 23 2009 Running Realtime Stats Service on MySQL
15. Static PPM
PPM: Prediction by Partial Matching
What is the next character after quot;.coquot;?
The answer is quot;mquot;!
PPM is used by 7-zip, etc.
Static PPM is PPM with static probabilistic
model
Many URLs (or English words) have common
patterns
Suitable for short texts (like URLs)
15
Apr. 23 2009 Running Realtime Stats Service on MySQL
16. Range Coder
A fast variant of arithmetic compression
similar to huffmann encoding, but better
If probability of next character being quot;mquot; was
75%, it will be encoded into 0.42 bit
Compressed strings preserve the sort
order of uncompressed form
16
Apr. 23 2009 Running Realtime Stats Service on MySQL
17. Create Compression Functions
Build prediction table from stored URLs
Implement range coder
took an open-source impl. and optimized it
original impl. added some bits unnecessary at the tail
use SSE instructions for faster operation
coderepos.org/share/browser/lang/cplusplus/range_coder
Link the coder and the table to create
MySQL UDFs
17
Apr. 23 2009 Running Realtime Stats Service on MySQL
18. Rewriting the Server Logic
Change schema
url varchar(255) not null # with unique index
urlc varbinary(767) not null # with unique index
Change prefix-search form
url like 'http://example.com/%'
url_compress('http://example.com/')<=urlc and
urlc<url_compress('http://example.com0')
Note: quot;0quot; is next character of '/'
18
Apr. 23 2009 Running Realtime Stats Service on MySQL
19. Compression Ratio
Compression ratio: 37%
Size of prediction table: 4MB
Benchmark of the compression functions
compression: 40MB/sec. (570k URLs/sec.)
decompression: 19.3MB/sec. (280k URLs/sec.)
fast enough since searchable in compressed form
Prefix-search became faster
shorter indexes lead to faster operation
19
Apr. 23 2009 Running Realtime Stats Service on MySQL
20. Re InnoDB Compression
URL Compression can coexist with
InnoDB compression
though we aren't using InnoDB compression on our
production environment
Compression Table Size
N/A 100%
URL compression 57%
InnoDB compression 50%
using both 33%
20
Apr. 23 2009 Running Realtime Stats Service on MySQL
21. Compressing the Stats Table
Used to have two int columns: at, cnt
it was waste of space, since...
most cnt values are very small numbers
most accesses to each URL occur on a short period (ex.
the day the blog entry was written)
at field should be part of the indexes
at (hours since epoch)
cnt (# of hits)
330168
1
330169
2
330173
1
330197
1
21
Apr. 23 2009 Running Realtime Stats Service on MySQL
22. Compressing the Stats Table (cont'd)
Merge the rows into a sparse array
example on the prev. page becomes:
(offset=330197),1,0(repeated 23 times),1,2,1
Then compress the array
the example becomes a blob of 8 bytes
originally was 8 bytes x 4 rows with index
And store the array in a single column
fewer rows lead to smaller table, faster access
22
Apr. 23 2009 Running Realtime Stats Service on MySQL
23. Compressing the Stats Table (cont'd)
Write MySQL UDFs to access the sparse
array
cnt_add(column,at,cnt)
-- adds cnt on given index (at)
cnt_between(column,from,to)
-- returns # of hits between given hours
and more...
We use int[N] arrays for vectorized calc.
especially when creating access charts
23
Apr. 23 2009 Running Realtime Stats Service on MySQL
24. Create a new Message Queue
24
Apr. 23 2009 Running Realtime Stats Service on MySQL
25. Q4M
A simple, reliable, fast message queue
runs as a pluggable storage engine of MySQL
GPL License; q4m.31tools.com
presented yesterday at MySQL Conference :-p
slides at tinyurl.com/q4m2009
Used for relaying messages between our
servers
25
Apr. 23 2009 Running Realtime Stats Service on MySQL
27. Limit # of CPU-intensive Pre-computations
Use cron & setlock
setlock is part of daemontools by djb
setlock
serializes processes by using flock
-n option: use trylock; if locked, do nothing
# use only one CPU core for pre-computation
*/2 * * * * setlock –n /tmp/tasks.lock precompute_hot_entries
50*** setlock /tmp/tasks.lock precompute_yesterday_data
27
Apr. 23 2009 Running Realtime Stats Service on MySQL
28. Limit # of Disk-intensive Pre-computations
Divide pre-computation to blocks and
sleep depending on the elapsed time
my $LOAD = 0.25;
while (true) {
my $start = time();
precompute_block();
sleep(min(time - $start, 0) * (1 - $LOAD) / $LOAD);
}
28
Apr. 23 2009 Running Realtime Stats Service on MySQL
29. Creating our own Cache System
29
Apr. 23 2009 Running Realtime Stats Service on MySQL
30. The Problem
Query cache is flushed on table update
access stats can be (should be) cached for a
certain period
Memcached has a thundering-herd
problem
all clients try to read the database when a
cached-entry expires
critical for us since our queries does joins,
aggregations, and sort operations
30
Apr. 23 2009 Running Realtime Stats Service on MySQL
31. Swifty and KeyedMutex
Swifty is a mmap-based cache
cached data shared between processes
lock-free on read, flock on write
notifies a single client that the accessed entry is
going to expire within few seconds
notified client can start updating a cache entry
before it expires
KeyedMutex
a daemon used to block multiple clients issuing
same SQL queries
31
Apr. 23 2009 Running Realtime Stats Service on MySQL
32. Swifty and KeyedMutexd (cont'd)
Source codes are available:
coderepos.org/share/browser/lang/c/swifty
coderepos.org/share/browser/lang/perl/Cache-Swifty
coderepos.org/share/browser/lang/perl/KeyedMutex
32
Apr. 23 2009 Running Realtime Stats Service on MySQL
34. Senna / Tritonn
Senna is a FTS engine popular in Japan
might not work well with European languages
Tritonn is a replacement of MyISAM FTS
uses Senna as backend
faster than MyISAM FTS
Wrote patches to support SSD
during our transition from RAM to SSD
patches accepted in Senna 1.1.4 / Tritonn 1.0.12
34
Apr. 23 2009 Running Realtime Stats Service on MySQL
35. FTS: RAM-based vs. SSD-based
Size of FTS data: 20GB
Downgraded hardware to see if SSD-
based FTS is feasible
Speed became ¼
but latency of searches are well below one second
Old Hardware
New Hardware
CPU
Opteron 2218 (2.6GHz) x2
Opteron 240 (1.4GHz)
Memory
32GB
2GB
Storage
7200rpm SATA HDD
SSD (Intel X25-M)
35
Apr. 23 2009 Running Realtime Stats Service on MySQL
36. Summary
36
Apr. 23 2009 Running Realtime Stats Service on MySQL
37. Summary
Use UDFs for optimization
Sometime it is easier to scale UP
esp. when you can estimate your data growth
Use SSD for FTS
Baidu (China's leading search engine) uses SSD
Most of the things introduced are OSS
We plan to open-source our URL compression
table as well
37
Apr. 23 2009 Running Realtime Stats Service on MySQL
38. We are Looking for...
If you are interested in localizing
Pathtraq to your country, please contact
us
we do not have resources outside of Japan
to translate the web interface
to ask people to install our browser extension
to follow local regulations, etc.
38
Apr. 23 2009 Running Realtime Stats Service on MySQL
39. Thank you for listening
tinyurl.com/kazuho
39
Apr. 23 2009 Running Realtime Stats Service on MySQL