HybridStore is an efficient data management system for hybrid flash-based sensor devices. It partitions data streams into segments, creates indexes for each segment, and organizes segments with an inter-segment index. This allows it to skip unnecessary segments and have small per-segment indexes. HybridStore features include fully occupying NAND pages written sequentially, avoiding in-place updates, efficiently processing queries over large datasets, and being sensor-friendly with low memory requirements. It was implemented on TinyOS and shown to outperform alternatives through trace-driven simulations involving millions of records.
Transient and persistent RDF views over relational databases in the context o...Nikolaos Konstantinou
As far as digital repositories are concerned, numerous benefits emerge from the disposal of their contents as Linked Open Data (LOD). This leads more and more repositories towards this direction. However, several factors need to be taken into account in doing so, among which is whether the transition needs to be materialized in real-time or in asynchronous time intervals. In this paper we provide the problem framework in the context of digital repositories, we discuss the benefits and drawbacks of both approaches and draw our conclusions after evaluating a set of performance measurements. Overall, we argue that in contexts with infrequent data updates, as is the case with digital repositories, persistent RDF views are more efficient than real-time SPARQL-to-SQL rewriting systems in terms of query response times, especially when expensive SQL queries are involved.
Modern computationally intensive tasks are rarely bottlenecked on the absolute performance of your processor cores, the real bottleneck in 2012 is getting data out of memory. CPU Caches are designed to alleviate the difference in performance between CPU Core Clockspeed and main memory clockspeed, but developers rarely understand how this interaction works or how to measure or tune their application accordingly.
This Talk aims to solve that by:
1. Describing how the CPU caches work in the latest Intel Hardware.
2. Showing people what and how to measure in order to understand the caching behaviour of their software.
3. Giving examples of how this affects Java Program performance and what can be done to address things.
Network Coding for Distributed Storage Systems(Group Meeting Talk)Jayant Apte, PhD
Reviews work of Koetter et al. and Dimakis et al.
The former provides an algebraic framework for linear network coding. The latter reduces the so called repair problem to single-source multicast network-coding problem and shows that there is a tradeoff between amount of data stored in a distributed sturage system and amount of data transfer required to repair the system if a node(hard-drive) fails.
Transient and persistent RDF views over relational databases in the context o...Nikolaos Konstantinou
As far as digital repositories are concerned, numerous benefits emerge from the disposal of their contents as Linked Open Data (LOD). This leads more and more repositories towards this direction. However, several factors need to be taken into account in doing so, among which is whether the transition needs to be materialized in real-time or in asynchronous time intervals. In this paper we provide the problem framework in the context of digital repositories, we discuss the benefits and drawbacks of both approaches and draw our conclusions after evaluating a set of performance measurements. Overall, we argue that in contexts with infrequent data updates, as is the case with digital repositories, persistent RDF views are more efficient than real-time SPARQL-to-SQL rewriting systems in terms of query response times, especially when expensive SQL queries are involved.
Modern computationally intensive tasks are rarely bottlenecked on the absolute performance of your processor cores, the real bottleneck in 2012 is getting data out of memory. CPU Caches are designed to alleviate the difference in performance between CPU Core Clockspeed and main memory clockspeed, but developers rarely understand how this interaction works or how to measure or tune their application accordingly.
This Talk aims to solve that by:
1. Describing how the CPU caches work in the latest Intel Hardware.
2. Showing people what and how to measure in order to understand the caching behaviour of their software.
3. Giving examples of how this affects Java Program performance and what can be done to address things.
Network Coding for Distributed Storage Systems(Group Meeting Talk)Jayant Apte, PhD
Reviews work of Koetter et al. and Dimakis et al.
The former provides an algebraic framework for linear network coding. The latter reduces the so called repair problem to single-source multicast network-coding problem and shows that there is a tradeoff between amount of data stored in a distributed sturage system and amount of data transfer required to repair the system if a node(hard-drive) fails.
This presentation was presented by Martin Kersten (CWI), well known in the Dutch eScience and scientific computing community, at the Netherlands eScience Center (NLeSC) on November 9, 2011 in Amsterdam, Netherlands.
Abstract of the presentation:
This presentation gives an introduction to NoSQL (Not only SQL) (pdf) databases with examples from MonetDB and discussed, applications and limitations.
What every software engineer should know about streams and tables in kafka ...confluent
Michael Noll, Senior Technologist, Office of the CTO, Confluent
What Every Software Engineer Should Know about Streams and Tables in Kafka
https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/267684714/
In this presentation, we introduce liblightnvm, a user space library that manages provisioning and I/O submission for physical flash.
We argue how liblightnvm can benefit I/O-intensive applications by providing predictable latency and reducing device write amplification, thus prolonging the device's endurance. We show how to integrate liblightnvm with RocksDB.
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.You’ll then hear from a customer who has leveraged Redshift in their industry and how they have adopted many of the best practices. Learn More: https://aws.amazon.com/government-education/
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
This presentation was presented by Martin Kersten (CWI), well known in the Dutch eScience and scientific computing community, at the Netherlands eScience Center (NLeSC) on November 9, 2011 in Amsterdam, Netherlands.
Abstract of the presentation:
This presentation gives an introduction to NoSQL (Not only SQL) (pdf) databases with examples from MonetDB and discussed, applications and limitations.
What every software engineer should know about streams and tables in kafka ...confluent
Michael Noll, Senior Technologist, Office of the CTO, Confluent
What Every Software Engineer Should Know about Streams and Tables in Kafka
https://www.meetup.com/Zurich-Apache-Kafka-Meetup-by-Confluent/events/267684714/
In this presentation, we introduce liblightnvm, a user space library that manages provisioning and I/O submission for physical flash.
We argue how liblightnvm can benefit I/O-intensive applications by providing predictable latency and reducing device write amplification, thus prolonging the device's endurance. We show how to integrate liblightnvm with RocksDB.
Efficient Parallel Set-Similarity Joins Using MapReduce - Posterrvernica
In this paper we study how to efficiently perform set-similarity joins in parallel using the popular MapReduce framework. We propose a 3-stage approach for end-to-end set-similarity joins. We take as input a set of records and output a set of joined records based on a set-similarity condition. We efficiently partition the data across nodes in order to balance the workload and minimize the need for replication. We study both self-join and R-S join cases, and show how to carefully control the amount of data kept in main memory on each node. We also propose solutions for the case where, even if we use the most fine-grained partitioning, the data still does not fit in the main memory of a node. We report results from extensive experiments on real datasets, synthetically increased in size, to evaluate the speedup and scaleup properties of the proposed algorithms using Hadoop.
Best Practices for Data Warehousing with Amazon Redshift | AWS Public Sector ...Amazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management, tune your queries, and use Amazon Redshift's interleaved sorting features.You’ll then hear from a customer who has leveraged Redshift in their industry and how they have adopted many of the best practices. Learn More: https://aws.amazon.com/government-education/
Similar to Presentation hybrid store-ewsn-2013 (20)
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Sudheer Mechineni, Head of Application Frameworks, Standard Chartered Bank
Discover how Standard Chartered Bank harnessed the power of Neo4j to transform complex data access challenges into a dynamic, scalable graph database solution. This keynote will cover their journey from initial adoption to deploying a fully automated, enterprise-grade causal cluster, highlighting key strategies for modelling organisational changes and ensuring robust disaster recovery. Learn how these innovations have not only enhanced Standard Chartered Bank’s data infrastructure but also positioned them as pioneers in the banking sector’s adoption of graph technology.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex Proofs
Presentation hybrid store-ewsn-2013
1. HybridStore: An Efficient Data Management System for
Hybrid Flash-based Sensor Devices
Baobing Wang and John S. Baras
Department of Electrical and Computer Engineering
Institute for Systems Research
University of Maryland, College Park, USA
briankw@umd.edu
10th European Conference on Wireless Sensor Networks (EWSN)
February 14, 2013
Brian (UMD@USA) HybridStore February 14, 2013 1 / 15
2. Motivation
In-situ Data Storage on Sensor Motes
Centralized data collection: energy wastes (e.g., TinyDB)
LoCal project1 : 455 nodes, > 900M readings/year
Only aggregated data are required: average noise level, peak power
consumption, usage pattern
Sensors store data locally: sensor database
Flash memory: high capacity, energy efficient
Figure:
1
http://local.cs.berkeley.edu/
Brian (UMD@USA) HybridStore February 14, 2013 2 / 15
3. Motivation
In-situ Data Storage on Sensor Motes
Centralized data collection: energy wastes (e.g., TinyDB)
LoCal project1 : 455 nodes, > 900M readings/year
Only aggregated data are required: average noise level, peak power
consumption, usage pattern
Sensors store data locally: sensor database
Flash memory: high capacity, energy efficient
Figure: Per-byte cost: storage, computation and communication [Mathur’06]
1
http://local.cs.berkeley.edu/
Brian (UMD@USA) HybridStore February 14, 2013 2 / 15
4. Motivation
Design Challenges
Unlike magnetic disks, no in-place updates on flash memories
NOR flash: byte-oriented, random-accessible, low capacity
NAND flash: page-oriented, high capacity, more energy-efficient
Random writes are 100× more expensive than sequential writes
Very limited RAM: 4KB to 10KB
Brian (UMD@USA) HybridStore February 14, 2013 3 / 15
5. Related Work
Flash-based Storage Systems
Only time-window queries: TL-Tree [Li’12], FlashLog [Nath’09]
Large RAM footprint: FlashDB [Nath’07], LA-Tree [Agrawal’09]
Antelope [Tsiftes’11]: NOR flash only, discrete values
MicroHash [Lin’06]: long chain of partial pages, extensive page reads
and writes, complex failure recovery
No efficient joint queries support, global index
Brian (UMD@USA) HybridStore February 14, 2013 4 / 15
6. Contributions
HybridStore Interface
insert(float key , void* record, uint8 t length)
select(uint32 t t1 , uint32 t t2 , float k1 , float k2 )
HybridStore Features
All NAND pages are fully occupied and written purely sequentially
In-place updates and out-of-place writes are completely avoided
Process typical joint queries efficiently, even on large-scale datasets
Data aging without overhead
Sensor-friendly: 16.5KB ROM and 3.2KB RAM in TinyOS 2.1
Potential Applications
Storage layer abstraction: Squirrel [Mottola’10]
Brian (UMD@USA) HybridStore February 14, 2013 5 / 15
7. Contributions
HybridStore Interface
insert(float key , void* record, uint8 t length)
select(uint32 t t1 , uint32 t t2 , float k1 , float k2 )
HybridStore Features
All NAND pages are fully occupied and written purely sequentially
In-place updates and out-of-place writes are completely avoided
Process typical joint queries efficiently, even on large-scale datasets
Data aging without overhead
Sensor-friendly: 16.5KB ROM and 3.2KB RAM in TinyOS 2.1
Potential Applications
Storage layer abstraction: Squirrel [Mottola’10]
Brian (UMD@USA) HybridStore February 14, 2013 5 / 15
8. Contributions
HybridStore Interface
insert(float key , void* record, uint8 t length)
select(uint32 t t1 , uint32 t t2 , float k1 , float k2 )
HybridStore Features
All NAND pages are fully occupied and written purely sequentially
In-place updates and out-of-place writes are completely avoided
Process typical joint queries efficiently, even on large-scale datasets
Data aging without overhead
Sensor-friendly: 16.5KB ROM and 3.2KB RAM in TinyOS 2.1
Potential Applications
Storage layer abstraction: Squirrel [Mottola’10]
Brian (UMD@USA) HybridStore February 14, 2013 5 / 15
9. HybridStore: Overview
Partition the data stream into segments
Create an in-segment index for each segment
Create an inter-segment index to organize segments
Benefits: skip unnecessary segments, small index per segment
Brian (UMD@USA) HybridStore February 14, 2013 6 / 15
10. HybridStore: Index Management
Inter-segment skip list: addr , tmin , locate segments within [t1 , t2 ]
NULL
Header
In-segment β-Tree: locate records within [k1 , k2 ]
In-segment Bloom filter: check the existence of key values if k1 = k2
Brian (UMD@USA) HybridStore February 14, 2013 7 / 15
11. HybridStore: Index Management
Inter-segment skip list: addr , tmin , locate segments within [t1 , t2 ]
NULL
Header
In-segment β-Tree: locate records within [k1 , k2 ]
In-segment Bloom filter: check the existence of key values if k1 = k2
Brian (UMD@USA) HybridStore February 14, 2013 7 / 15
13. HybridStore: In-segment Index
In-segment Bloom filter: check the existence of key values if k1 = k2
1 qn q
v bits, q hash functions, represent n items: p = 1 − 1 − v
Must be maintained in RAM: NOR flash is byte-oriented
If q = 3, n = 4096, p ≈ 3.06%, then v = 32768 (i.e., 4KB)
Horizontal partition: fixed small bloom filter sections (e.g., 256B)
Vertical partition: group fragments with the same offset in the same
NAND page
Brian (UMD@USA) HybridStore February 14, 2013 9 / 15
14. HybridStore: In-segment Index
In-segment Bloom filter: check the existence of key values if k1 = k2
1 qn q
v bits, q hash functions, represent n items: p = 1 − 1 − v
Must be maintained in RAM: NOR flash is byte-oriented
If q = 3, n = 4096, p ≈ 3.06%, then v = 32768 (i.e., 4KB)
Horizontal partition: fixed small bloom filter sections (e.g., 256B)
Vertical partition: group fragments with the same offset in the same
NAND page
Brian (UMD@USA) HybridStore February 14, 2013 9 / 15
15. HybridStore: In-segment Index
In-segment Bloom filter: check the existence of key values if k1 = k2
1 qn q
v bits, q hash functions, represent n items: p = 1 − 1 − v
Must be maintained in RAM: NOR flash is byte-oriented
If q = 3, n = 4096, p ≈ 3.06%, then v = 32768 (i.e., 4KB)
Horizontal partition: fixed small bloom filter sections (e.g., 256B)
Vertical partition: group fragments with the same offset in the same
NAND page
Brian (UMD@USA) HybridStore February 14, 2013 9 / 15
16. HybridStore: Storage Hierarchy
NOR flash: circular array, fixed segment size
NAND flash: circular array, logical segment (multiple erase blocks)
Index structure: updated in a NOR segment, copied to the NAND
segment later
Header: [T1 , T2 ], [K1 , K2 ], dataAddr , idxAddr , bfAddr , skipList
Skip List Header
Bloom Write Read
Filter Buffer RAM
Readings ... Readings
...
Buffer Buffer
Bloom Filter
NOR NOR
Adaptive
Segment ... Segment
NOR
Binary Tree Readings ... Readings
Bloom Filter Tree
...
Segment Segment
...
Segment
NAND
Tree }Header
Page
(a) Storage Hierarchy (b) NAND Segment Structure
Brian (UMD@USA) HybridStore February 14, 2013 10 / 15
17. HybridStore: Operations
Insertion
Update the β tree: allocate new bucket if necessary
Update the Bloom filter buffer: flush it out to NOR flash if necessary
NOR segment is full: copy to the NAND segment, update the skip list,
start a new segment
Querying: t1 , t2 , k1 , k2
t1 = t2 t1 < t2
k1 = k2 skip list skip list + Bloom filter + β-Tree
k1 < k2 skip list skip list + β-Tree
Skip a segment if [K1 , K2 ] ⊂ [k1 , k2 ]
Data Aging: delete the oldest NAND segment
No need to update any pointer
No need to move any data page
Brian (UMD@USA) HybridStore February 14, 2013 11 / 15
18. HybridStore: Operations
Insertion
Update the β tree: allocate new bucket if necessary
Update the Bloom filter buffer: flush it out to NOR flash if necessary
NOR segment is full: copy to the NAND segment, update the skip list,
start a new segment
Querying: t1 , t2 , k1 , k2
t1 = t2 t1 < t2
k1 = k2 skip list skip list + Bloom filter + β-Tree
k1 < k2 skip list skip list + β-Tree
Skip a segment if [K1 , K2 ] ⊂ [k1 , k2 ]
Data Aging: delete the oldest NAND segment
No need to update any pointer
No need to move any data page
Brian (UMD@USA) HybridStore February 14, 2013 11 / 15
19. HybridStore: Operations
Insertion
Update the β tree: allocate new bucket if necessary
Update the Bloom filter buffer: flush it out to NOR flash if necessary
NOR segment is full: copy to the NAND segment, update the skip list,
start a new segment
Querying: t1 , t2 , k1 , k2
t1 = t2 t1 < t2
k1 = k2 skip list skip list + Bloom filter + β-Tree
k1 < k2 skip list skip list + β-Tree
Skip a segment if [K1 , K2 ] ⊂ [k1 , k2 ]
Data Aging: delete the oldest NAND segment
No need to update any pointer
No need to move any data page
Brian (UMD@USA) HybridStore February 14, 2013 11 / 15
20. HybridStore: Implementation and Evaluation
TinyOS implementation: 16.5KB ROM, 3.2KB RAM
Trace-driven simulation: over 2.6 million weather records in 5 years
Insertion: 13% ∼ 18% improvement
2 90 40
β−Tree Static tree β−Tree Static tree β−Tree Static tree
1.8 80 35
1.6
70
30
1.4
60
Space Overhead (%)
25
1.2
Energy (µJ)
Time (ms)
50
1 20
40
0.8
15
30
0.6
10
20
0.4
10 5
0.2
0 0 0
64 128 256 64 128 256 64 128 256
NOR Flash Segment Size (KB) NOR Flash Segment Size (KB) NOR Flash Segment Size (KB)
(a) Latency (b) Energy (c) Space Overhead
Figure: Performance per insertion
Brian (UMD@USA) HybridStore February 14, 2013 12 / 15
21. HybridStore: Value-based Equality Query
Key detection: 26.18ms and 1.5mJ over 0.5 million readings
Nonexistent keys: more than 3× improvement
300 18
β−Tree (64KB) β−Tree (64KB)
β−Tree (128KB) 16 β−Tree (128KB)
250
β−Tree (256KB) β−Tree (256KB)
14
β−Tree (64KB w/o BF) β−Tree (64KB w/o BF)
200 Static (128KB) 12 Static (128KB)
Energy (mJ)
Time (ms)
10
150
8
100 6
4
50
2
0 0
1 day 1 week 1 month 3 month 1 year 1 day 1 week 1 month 3 month 1 year
Time Range Time Range
(a) Latency (b) Energy
Figure: Impact of Bloom filter for nonexistent keys
Brian (UMD@USA) HybridStore February 14, 2013 13 / 15
22. HybridStore: Full Query
Retrieve 120K readings in 11.08 seconds from 0.5 million records
[SenSys ’11]: over 20 seconds to get 50% from 50, 000 records
12 700
1 degree 1 degree
3 degree 3 degree
600
10 5 degree 5 degree
7 degree 7 degree
9 degree 500 9 degree
Energy (mJ) / Query
8
Time (s) / Query
400
6
300
4
200
2
100
0 0
1 day 1 week 1 month 3 months 6 months 1 year 1 day 1 week 1 month 3 months 6 months 1 year
Time Range Time Range
(a) Total Latency per query (b) Total energy per query
Figure: HybridStore performance per query of full queries
Brian (UMD@USA) HybridStore February 14, 2013 14 / 15
23. Conclusion and Future Work
Conclusion
HybridStore: efficient, light-weight, and sensor-friendly
Process typical joint queries efficiently
Process large-scale dataset efficiently
Future Work2
Failure recovery mechanism
Distributed database system based on HybridStore
Testbed experiments
2
B. Wang and J. S. Baras. HybridDB: An Efficient Database System Supporting
Incremental epsilon-Approximate Querying for Storage-Centric Sensor Networks.
Submitted to the ACM Transactions on Sensor Networks, 2013, pp. 1–35
Brian (UMD@USA) HybridStore February 14, 2013 15 / 15