The document discusses using an append-only approach to enable real-time analytics with HBase. It describes how replacing read-modify-write operations with simple append writes can increase throughput and efficiently process updates. An open-source implementation called HBaseHUT is presented that uses update processors to aggregate data on reads and periodic jobs to merge updates in batches. The approach aims to provide real-time visibility of data changes while handling high volumes of input.
HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.
In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach uses HBase core strengths like fast range scans and the recently added coprocessors to enable real-time analytics. It shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making real-time analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
Presentation given at the 3rd International Conference on Ambient Systems, Networks and Technologies
August 27-29, 2012, Niagara Falls, Ontario, Canada.
The paper is available on the PORTO open access repository: http://porto.polito.it/2496720/
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
HBaseCon 2012 | Real-time Analytics with HBase - SematextCloudera, Inc.
In this talk we’ll explain how we implemented “update-less updates” (not a typo!) for HBase using append-only approach. This approach uses HBase core strengths like fast range scans and the recently added coprocessors to enable real-time analytics. It shines in situations where high data volume and velocity make random updates (aka Get+Put) prohibitively expensive. Apart from making real-time analytics possible, we’ll show how the append-only approach to updates makes it possible to perform rollbacks of data changes and avoid data inconsistency problems caused by tasks in MapReduce jobs that fail after only partially updating data in HBase.
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...Fulvio Corno
Presentation given at the 3rd International Conference on Ambient Systems, Networks and Technologies
August 27-29, 2012, Niagara Falls, Ontario, Canada.
The paper is available on the PORTO open access repository: http://porto.polito.it/2496720/
Near-realtime analytics with Kafka and HBasedave_revell
A presentation at OSCON 2012 by Nate Putnam and Dave Revell about Urban Airship's analytics stack. Features Kafka, HBase, and Urban Airship's own open source projects statshtable and datacube.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
BigBase is a read-optimized version of HBase NoSQL data store and is FULLY, 100% HBase compatible. 100% compatibility means that the upgrade from HBase to BigBase and other way around does not involve data migration and even can be made without stopping the cluster (via rolling restart).
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Apache HBase - Introduction & Use CasesData Con LA
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable
This talk will introduce to Apache HBase and will give you an overview of Columnar databases. We will also talk about how Facebook is using HBase currently. We will talk about HBase security, Apache Phoenix and Apache Slider
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.
In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
Session takeaways:
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
The most critical large-scale applications today, regardless of industry, involve a demand for real-time data transfer and visualization of potentially large volumes of data. With this demand comes numerous challenges and limiting factors, especially if these applications are deployed in virtual or cloud environments. In this session, SL’s CEO, Tom Lubinski, explains how to overcome the top four challenges to real-time application performance: database performance, network data transfer bandwidth limitations, processor performance and lack of real-time predictability. Solutions discussed will include design of the proper data model for the application data, along with design patterns that facilitate optimal and minimal data transfer across networks.
Bobby Evans and Tom Graves, the engineering leads for Spark and Storm development at Yahoo will talk about how these technologies are used on Yahoo's grids and reasons why to use one or the other.
Bobby Evans is the low latency data processing architect at Yahoo. He is a PMC member on many Apache projects including Storm, Hadoop, Spark, and Tez. His team is responsible for delivering Storm as a service to all of Yahoo and maintaining Spark on Yarn for Yahoo (Although Tom really does most of that work).
Tom Graves a Senior Software Engineer on the Platform team at Yahoo. He is an Apache PMC member on Hadoop, Spark, and Tez. His team is responsible for delivering and maintaining Spark on Yarn for Yahoo.
BigBase is a read-optimized version of HBase NoSQL data store and is FULLY, 100% HBase compatible. 100% compatibility means that the upgrade from HBase to BigBase and other way around does not involve data migration and even can be made without stopping the cluster (via rolling restart).
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Apache HBase - Introduction & Use CasesData Con LA
Apache HBase™ is the Hadoop database, a distributed, scalable, big data store. Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, non-relational database modeled after Google's Bigtable
This talk will introduce to Apache HBase and will give you an overview of Columnar databases. We will also talk about how Facebook is using HBase currently. We will talk about HBase security, Apache Phoenix and Apache Slider
HBaseCon 2012 | Gap Inc Direct: Serving Apparel Catalog from HBase for Live W...Cloudera, Inc.
Gap Inc Direct, the online division for Gap Inc., uses HBase to serve, in real-time, apparel catalog for all its brands’ and markets’ web sites. This case study will review the business case as well as key decisions regarding schema selection and cluster configurations. We will also discuss implementation challenges and insights that were learned.
HBaseCon 2012 | HBase, the Use Case in eBay Cassini Cloudera, Inc.
eBay marketplace has been working hard on the next generation search infrastructure and software system, code-named Cassini. The new search engine processes over 250 million search queries and serves more than 2 billion page views each day. Its indexing platform is based on Apache Hadoop and Apache HBase. Apache HBase is a distributed persistent layer built on Hadoop to support billions of updates per day. Its easy sharding character, fast writes, and table scans, super fast data bulk load, and natural integration to Hadoop provide the cornerstones for successful continuous index builds. We will share with the audience the technical details and share the difficulties and challenges that we’ve gone through and that we are still facing in the process.
In this age of Big Data, data volumes grow exceedingly larger while the technical problems and business scenarios become more complex. Compounding these complexities, data consumers are demanding faster analysis to common business questions asked of their Big Data. This session provides concrete examples of how to address this challenge. We will highlight the use of Big Data technologies—including Hadoop and Hive —with classic BI systems such as SQL Server Analysis Services.
Session takeaways:
• Understand the architectural components surrounding Hadoop, Hive, Classic BI, and the Tier-1 BI ecosystem
• Get strategies for addressing the technical issues when working with extremely large cubes
• See how to address the technical issues when working with Big Data systems from the DBA perspective
Overcoming the Top Four Challenges to Real-Time Performance in Large-Scale, D...SL Corporation
The most critical large-scale applications today, regardless of industry, involve a demand for real-time data transfer and visualization of potentially large volumes of data. With this demand comes numerous challenges and limiting factors, especially if these applications are deployed in virtual or cloud environments. In this session, SL’s CEO, Tom Lubinski, explains how to overcome the top four challenges to real-time application performance: database performance, network data transfer bandwidth limitations, processor performance and lack of real-time predictability. Solutions discussed will include design of the proper data model for the application data, along with design patterns that facilitate optimal and minimal data transfer across networks.
Modern business is fast and needs to take decisions immediatly. It cannot wait that a traditional BI task that snapshots data at some time. Social data, Internet of Things, Just in Time don't undestand "snapshot" and needs working on streaming, live data. Microsoft offers a solution on the Cloud with Azure (Stream Analytics). Let's see how it works.
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Bhupesh Bansal
Jan 22nd, 2010 Hadoop meetup presentation on project voldemort and how it plays well with Hadoop at linkedin. The talk focus on Linkedin Hadoop ecosystem. How linkedin manage complex workflows, data ETL , data storage and online serving of 100GB to TB of data.
What is going on? Application Diagnostics on Azure - Copenhagen .NET User GroupMaarten Balliauw
We all like building and deploying cloud applications. But what happens once that’s done? How do we know if our application behaves like we expect it to behave? Of course, logging! But how do we get that data off of our machines? How do we sift through a bunch of seemingly meaningless diagnostics? In this session, we’ll look at how we can keep track of our Azure application using structured logging, AppInsights and AppInsights analytics to make all that data more meaningful.
The presentation aims to demystify the practice of building reliable data processing pipelines. It includes a brief overview of the pieces needed to build a stable processing platform: data ingestion,processing engines, workflow management, and schemas. For each component, suitable components are suggested, as well as best practices and pitfalls to avoid, most learnt through expensive mistakes.
Original document: https://goo.gl/rmKxZM
zenon Analyzer is extremely easy to configure and install. It has the capability to process and analyse data from all automated production equipment within your facility as well as other sources of data (e.g. manual data input). So you not only get immediate information, but also a complete picture of the status of productivity within your facility.
GOTO Aarhus 2014: Making Enterprise Data Available in Real Time with elastics...Yann Cluchey
My talk from GOTO Aarhus, 30th September 2014. Cogenta is a retail intelligence company which tracks ecommerce web sites around the world to provide competitive monitoring and analysis services to retailers. Using its proprietary crawler technology, Lucene and SQL Server, a stream of 20 million raw product data entries is captured and processed each day. This case study looks at how Cogenta uses Elasticsearch to break the shackles imposed by the RDBMS (and a limited budget) to make the data available in real time to its customers.
Cogenta uses SQL as its canonical store & for complex reporting, and Elasticsearch for real-time processing & to drive its SaaS web applications. Elasticsearch is easy to use, delivers the powerful features of Lucene and enables the data & platform cost to scale linearly. But… synchronising your existing data in two places presents some interesting challenges such as aggregation and concurrency control. This talk will take a detailed look at how Cogenta how overcame those challenges, with a perpetually changing and asynchronously updated dataset.
http://gotocon.com/aarhus-2014/presentation/Cogenta%20-%20Making%20Enterprise%20Data%20Available%20in%20Real%20Time%20with%20Elasticsearch
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Real-time Analytics with HBase (short version)
1. Real-time Analytics with
HBase
Alex Baranau, Sematext International
(short version)
Tuesday, June 5, 12
2. About me
Software Engineer at Sematext International
http://blog.sematext.com/author/abaranau
@abaranau
http://github.com/sematext (abaranau)
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
3. Plan
Problem background: what? why?
Going real-time with append-only updates
approach: how?
Open-source implementation: how exactly?
Q&A
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
4. Background: our services
Systems Monitoring Service (Solr, HBase, ...)
Search Analytics Service
data collector Reports
Data
100
75
data collector Analytics & 50
Storage 25
data collector 0
2007 2008 2009 2010
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
5. Background: Report Example
Search engine (Solr) request latency
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
6. Background: requirements
High volume of input data
Multiple filters/dimensions
Interactive (fast) reports
Show wide range of data intervals
Real-time data changes visibility
No sampling, accurate data needed
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
7. Background: serve raw data?
simply storing all data points doesn’t work
to show 1-year worth of data points collected every second
31,536,000 points have to be fetched
pre-aggregation (at least partial) needed
Data Analytics & Storage Reports
aggregated data
input data
data processing
(pre-aggregating)
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
8. Background: pre-aggregation
OLAP-like Solution
aggregation rules
* filters/dimensions
* time range granularities aggregated
* ... value
aggregated
input data processing value
item
logic
aggregated
value
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
9. Background: RMW updates are slow
more dimensions/filters -> greater output data vs input data
ratio
individual ready-modify-write (Get+Put) operations are slow
and not efficient (10-20+ times slower than only Puts)
sensor1 sensor2
... <...>
sensor2
value:15.0 ... value:41.0
input
Get Put Get Put Get Put
... sensor1 sensor2
... reports
<...> avg : 28.7 Get/Scan sensor2
min: 15.0
storage max: 41.0
(HBase)
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
10. Background: batch updates
More efficient data processing: multiple updates
processed at once, not individually
Decreases aggregation output (per input record)
Reliable, no data loss in case of failures
Not real-time
If done frequently (closer to real-time), still a lot of
costly Get+Put update operations
Handling of failures of tasks which partially wrote
data to HBase is complex
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
11. Going Real-time
with
Append-based Updates
Tuesday, June 5, 12
12. Append-only: main goals
Increase record update throughput
Process updates more efficiently: reduce
operations number and resources usage
Ideally, apply high volume of incoming data
changes in real-time
Add ability to roll back changes
Handle well high update peaks
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
13. Append-only: how?
1. Replace read-modify-write (Get+Put) operations
at write time with simple append-only writes (Put)
2. Defer processing of updates to periodic jobs
3. Perform processing of updates on the fly only
if user asks for data earlier than updates are
processed.
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
14. Append-only: writing updates
1 Replace update (Get+Put) operations at write time
with simple append-only writes (Put)
sensor1 sensor2 sensor2
... ... value:15.0 ... value:41.0 input
Put Put Put
... sensor1 sensor2
...
<...> avg : 22.7
max: 31.0
sensor1
<...>
...
... sensor2
value: 15.0
storage sensor2
value: 41.0
(HBase) ... Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
15. Append-only: writing updates
2 Defer processing of updates to periodic jobs
processing updates with MR job
... sensor1 sensor2
...
<...> avg : 22.7 ... sensor1 sensor2
...
max: 31.0 <...> avg : 23.4
sensor1
<...>
... max: 41.0
... sensor2
value: 15.0
sensor2
value: 41.0
...
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
16. Append-only: writing updates
3 Perform aggregations on the fly if user asks
for data earlier than updates are processed
... sensor1 sensor2
... reports
<...> avg : 22.7 sensor1 ...
max: 31.0
sensor1
<...>
...
... sensor2
value: 15.0
sensor2
storage value: 41.0
...
sensor2
avg : 23.4
max: 41.0
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
17. Append-only: benefits
High update throughput
Real-time updates visibility
Efficient updates processing
Handling high peaks of update operations
Ability to roll back any range of changes
Automatically handling failures of tasks which only
partially updated data (e.g. in MR jobs)
Update operation becomes idempotent & atomic, easy
to scale writers horizontally
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
18. 3/7
Append-only: efficient updates
To apply N changes:
N Get+Put operations replaced with
N Puts and 1 Scan (shared) + 1 Put operation
Applying N changes at once is much more
efficient than performing N individual changes
Especially when updated value is complex (like bitmaps),
takes time to load in memory
Skip compacting if too few records to process
Avoid a lot of redundant Get operations when
large portion of operations - inserting new data
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
19. 5/7
Append-only: rollback
Rollbacks are easy when updates were not
processed yet (not merged)
To preserve rollback ability after they are
processed (and result is written back), updates
can be compacted into groups
written at: processing updates
9:00 ... sensor2
...
... ... sensor2 ...
...
sensor2
...
...
10:00 sensor2 sensor2
... ...
sensor2
...
...
11:00 sensor2 sensor2
... ...
... ...
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
20. 6/7
Append-only: idempotency
Using append-only approach helps recover from
failed tasks which write data to HBase
without rolling back partial updates
avoids applying duplicate updates
fixes task failure with simple restart of task
Note: new task should write records with same row
keys as failed one
easy, esp. given that input data is likely to be same
Very convenient when writing from MapReduce
Updates processing periodic jobs are also idempotent
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
21. Append-only: cons
Processing on the fly makes reading slower
Looking for data to compact (during periodic
compactions) may be inefficient
Increased amount of stored data depending
on use-case (in 0.92+)
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
23. HBaseHUT: Overview
Simple
Easy to integrate into existing projects
Packed as a singe jar to be added to HBase client
classpath (also add it to RegionServer classpath to
benefit from server-side optimizations)
Supports native HBase API: HBaseHUT classes
implement native HBase interfaces
Apache License, v2.0
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
24. HBaseHUT: Overview
Processing of updates on-the-fly (behind
ResultScanner interface)
Allows storing back processed Result
Can use CPs to process updates on server-side
Periodic processing of updates with Scan or
MapReduce job
Including processing updates in groups based on write ts
Rolling back changes with MapReduce job
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
25. HBaseHUT: API overview
Writing data:
Put put = new Put(HutPut.adjustRow(rowKey));
// ...
hTable.put(put);
Reading data:
Scan scan = new Scan(startKey, stopKey);
ResultScanner resultScanner =
new HutResultScanner(hTable.getScanner(scan),
updateProcessor);
for (Result current : resultScanner) {...}
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
26. HBaseHUT: API overview
Example UpdateProcessor:
public class MaxFunction extends UpdateProcessor {
// ... constructor & utility methods
@Override
public void process(Iterable<Result> records,
UpdateProcessingResult result) {
Double maxVal = null;
for (Result record : records) {
double val = getValue(record);
if (maxVal == null || maxVal < val) {
maxVal = val;
}
}
result.add(colfam, qual, Bytes.toBytes(maxVal));
}
}
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
27. HBaseHUT: Next Steps
Wider CPs (HBase 0.92+) utilization
Process updates during memstore flush
Make use of Append operation (HBase 0.94+)
Integrate with asynchbase lib
Reduce storage overhead from adjusting row keys
etc.
Contributors are welcome!
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12
28. Qs?
http://github.com/sematext/HBaseHUT
http://blog.sematext.com
@abaranau
http://github.com/sematext (abaranau)
http://sematext.com, we are hiring! ;)
there will be a longer version of the presentation on the web
Alex Baranau, Sematext International, 2012
Tuesday, June 5, 12