HBase is a distributed, column-oriented database that runs on top of HDFS. It provides random real-time read/write access to big data. HBase tables are sharded into regions distributed across region servers. The master server manages the cluster and schema changes. Clients communicate with the master and region servers to read and write data. Keys are sorted and data is stored in columns within column families.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
In this lecture we analyze document oriented databases. In particular we consider why there are the first approach to nosql and what are the main features. Then, we analyze as example MongoDB. We consider the data model, CRUD operations, write concerns, scaling (replication and sharding).
Finally we presents other document oriented database and when to use or not document oriented databases.
Building a near real time search engine & analytics for logs using solrlucenerevolution
Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
This presentation will discuss implementing external authentication when using Percona Server for MongoDB and MongoDB Enterprise. It will review authentication using OpenLDAP or ActiveDirectory and ActiveDirectory with Kerberos.
The presentation will also include examples of the configurations required by these external directory services. It will also review the LDAP Authorization features introduced in MongoDB Enterprise 3.4.
[db tech showcase Tokyo 2017] A11: SQLite - The most used yet least appreciat...Insight Technology, Inc.
More instances of SQLite are used every day, by more people, than all other database engines combined. An yet, SQLite does not get much attention. Many developers hardly know anything about it. This session will review the features of SQLite, how it is different from other database engines, its strengths and its weaknesses, and when SQLite is an appropriate technology and when some other database engine might be a better choice.
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...DataStax
The 3.0 storage engine re-write is the biggest and most exciting change to ever happen in Apache Cassandra. The new storage engine can efficiently store and read data from disk using the same concepts present in the CQL 3 language. This has delivered large space savings, and creates new performance characteristics.
In this talk Aaron Morton, Co Founder at The Last Pickle and Apache Cassandra Committer, will discuss the 3.0 storage engine, it's layout and performance characteristics.
About the Speaker
Aaron Morton CEO, The Last Pickle
Aaron Morton is the Co Founder & CEO at The Last Pickle (thelastpickle.com). A professional services company that works with clients to deliver and improve Apache Cassandra based solutions. He's based in New Zealand, is an Apache Cassandra Committer and a DataStax MVP for Apache Cassandra.
This tutorial will guide you through the many considerations when deploying a sharded cluster. We will cover the services that make up a sharded cluster, configuration recommendations for these services, shard key selection, use cases, and how data is managed within a sharded cluster. Maintaining a sharded cluster also has its challenges. We will review these challenges and how you can prevent them with proper design or ways to resolve them if they exist today. There will be lab sessions at the end of some chapters so please have your laptops with you.
When we talk about bucketing we essentially talk about possibilities to split cassandra partitions in several smaller parts, rather than having only one large partition.
Bucketing of cassandra partitions can be crucial for optimizing queries, preventing large partitions or to fight TombstoneOverwhelmingException which can occur when creating too many tombstones.
In this talk I want to show how to recognize large partitions during datamodeling. I will also show different strategies we used in our projects to create, use and maintain buckets for our partitions.
About the Speaker
Markus Hofer IT Consultant, codecentric AG
Markus Hofer works as an IT Consultant for codecentric AG in Minster, Germany. He works on microservice architectures backed by DSE and/or Apache Cassandra. Markus supports and trains customers building cassandra based applications.
The slides we used at the first meetup hosted at Redis Labs' TLV offices :)
Touches on some of the more notable user-facing functionality in the newest Redis version, as well as interesting internal optimizations with major gains.
#RedisTLV: www.meetup.com/Tel-Aviv-Redis-Meetup/events/227594422/
Building a near real time search engine & analytics for logs using solrlucenerevolution
Presented by Rahul Jain, System Analyst (Software Engineer), IVY Comptech Pvt Ltd
Consolidation and Indexing of logs to search them in real time poses an array of challenges when you have hundreds of servers producing terabytes of logs every day. Since the log events mostly have a small size of around 200 bytes to few KBs, makes it more difficult to handle because lesser the size of a log event, more the number of documents to index. In this session, we will discuss the challenges faced by us and solutions developed to overcome them. The list of items that will be covered in the talk are as follows.
Methods to collect logs in real time.
How Lucene was tuned to achieve an indexing rate of 1 GB in 46 seconds
Tips and techniques incorporated/used to manage distributed index generation and search on multiple shards
How choosing a layer based partition strategy helped us to bring down the search response times.
Log analysis and generation of analytics using Solr.
Design and architecture used to build the search platform.
This presentation will discuss implementing external authentication when using Percona Server for MongoDB and MongoDB Enterprise. It will review authentication using OpenLDAP or ActiveDirectory and ActiveDirectory with Kerberos.
The presentation will also include examples of the configurations required by these external directory services. It will also review the LDAP Authorization features introduced in MongoDB Enterprise 3.4.
[db tech showcase Tokyo 2017] A11: SQLite - The most used yet least appreciat...Insight Technology, Inc.
More instances of SQLite are used every day, by more people, than all other database engines combined. An yet, SQLite does not get much attention. Many developers hardly know anything about it. This session will review the features of SQLite, how it is different from other database engines, its strengths and its weaknesses, and when SQLite is an appropriate technology and when some other database engine might be a better choice.
CQL performance with Apache Cassandra 3.0 (Aaron Morton, The Last Pickle) | C...DataStax
The 3.0 storage engine re-write is the biggest and most exciting change to ever happen in Apache Cassandra. The new storage engine can efficiently store and read data from disk using the same concepts present in the CQL 3 language. This has delivered large space savings, and creates new performance characteristics.
In this talk Aaron Morton, Co Founder at The Last Pickle and Apache Cassandra Committer, will discuss the 3.0 storage engine, it's layout and performance characteristics.
About the Speaker
Aaron Morton CEO, The Last Pickle
Aaron Morton is the Co Founder & CEO at The Last Pickle (thelastpickle.com). A professional services company that works with clients to deliver and improve Apache Cassandra based solutions. He's based in New Zealand, is an Apache Cassandra Committer and a DataStax MVP for Apache Cassandra.
This tutorial will guide you through the many considerations when deploying a sharded cluster. We will cover the services that make up a sharded cluster, configuration recommendations for these services, shard key selection, use cases, and how data is managed within a sharded cluster. Maintaining a sharded cluster also has its challenges. We will review these challenges and how you can prevent them with proper design or ways to resolve them if they exist today. There will be lab sessions at the end of some chapters so please have your laptops with you.
When we talk about bucketing we essentially talk about possibilities to split cassandra partitions in several smaller parts, rather than having only one large partition.
Bucketing of cassandra partitions can be crucial for optimizing queries, preventing large partitions or to fight TombstoneOverwhelmingException which can occur when creating too many tombstones.
In this talk I want to show how to recognize large partitions during datamodeling. I will also show different strategies we used in our projects to create, use and maintain buckets for our partitions.
About the Speaker
Markus Hofer IT Consultant, codecentric AG
Markus Hofer works as an IT Consultant for codecentric AG in Minster, Germany. He works on microservice architectures backed by DSE and/or Apache Cassandra. Markus supports and trains customers building cassandra based applications.
The slides we used at the first meetup hosted at Redis Labs' TLV offices :)
Touches on some of the more notable user-facing functionality in the newest Redis version, as well as interesting internal optimizations with major gains.
#RedisTLV: www.meetup.com/Tel-Aviv-Redis-Meetup/events/227594422/
Фреймворк Akka и его использование в ЯндексеVadim Tsesko
Доклад с JPoint 2014 (http://javapoint.ru).
Краткое содержание:
* Actor Model на примере Akka
* Происхождение
* Концепции и API
* Примеры кода
* Примеры систем в Яндекс
* Конвейерная обработка данных
* Реактивные иерархические системы
* Опыт разработки и эксплуатации
* Подводные камни
* Проблемы и некоторые решения
* Дополнительные тулы
В рамках магистерского курса "Параллельные вычисления" на кафедре КСПТ прочитал лекцию "Actor Model":
* Actor Model
* Futures and Promises
* Примеры систем
SFBigAnalytics_20190724: Monitor kafka like a ProChester Chen
Kafka operators need to provide guarantees to the business that Kafka is working properly and delivering data in real time, and they need to identify and triage problems so they can solve them before end users notice them. This elevates the importance of Kafka monitoring from a nice-to-have to an operational necessity. In this talk, Kafka operations experts Xavier Léauté and Gwen Shapira share their best practices for monitoring Kafka and the streams of events flowing through it. How to detect duplicates, catch buggy clients, and triage performance issues – in short, how to keep the business’s central nervous system healthy and humming along, all like a Kafka pro.
Speakers: Gwen Shapira, Xavier Leaute (Confluence)
Gwen is a software engineer at Confluent working on core Apache Kafka. She has 15 years of experience working with code and customers to build scalable data architectures. She currently specializes in building real-time reliable data processing pipelines using Apache Kafka. Gwen is an author of “Kafka - the Definitive Guide”, "Hadoop Application Architectures", and a frequent presenter at industry conferences. Gwen is also a committer on the Apache Kafka and Apache Sqoop projects.
Xavier Leaute is One of the first engineers to Confluent team, Xavier is responsible for analytics infrastructure, including real-time analytics in KafkaStreams. He was previously a quantitative researcher at BlackRock. Prior to that, he held various research and analytics roles at Barclays Global Investors and MSCI.
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
VMworld 2013: Deep Dive into vSphere Log Management with vCenter Log InsightVMworld
VMworld 2013
Steve Flanders, VMware
Chengdu Huang, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
GTIDs were introduced to solve replication problems and improve database consistency in MySQL database replication.
When, accidentally, transactions occur on a replica, this introduces GTIDs on that replica that don't exist on the master. When, on a master failover, this replica becomes the new master, and the corresponding binlogs of the errant GTIDs are already purged, replication breaks on the replicas of this new master, because those missing GTIDs can't be retrieved from the binlogs of this new master.
This presentation will talk about GTIDs and how to detect errant GTIDs on a replica (before the corresponding binlogs are purged) and how to look at the corresponding transactions in the binlogs. I'll give some examples of transactions that could happen on a replica that didn't originate from a primary node, explain how this is possible and share some tips on how to avoid this.
Basic understanding of MySQL database replication is assumed.
This presentation was at Percona Live 2019 in Austin, Texas.
https://www.percona.com/live/19/sessions/errant-gtids-breaking-replication-how-to-detect-and-avoid-them
This presentation recounts the story of Macys.com and Bloomingdales.com's migration from legacy RDBMS to NoSQL Cassandra in partnership with DataStax.
One thing that differentiates this talk from others on Cassandra is Macy's philosophy of "doing more with less." You will see why we emphasize the performance tuning aspects of iterative development when you see how much processing we can support on relatively small configurations.
This session will cover:
1) The process that led to our decision to use Cassandra
2) The approach we used for migrating from DB2 & Coherence to Cassandra without disrupting the production environment
3) The various schema options that we tried and how we settled on the current one. We'll show you a selection of some of our extensive performance tuning benchmarks, as well as how these performance results figured into our final schema designs.
4) Our lessons learned and next steps
Scaling ingest pipelines with high performance computing principles - Rajiv K...SignalFx
By Rajiv Kurian, software engineer at SignalFx.
At SignalFx, we deal with high-volume high-resolution data from our users. This requires a high performance ingest pipeline. Over time we’ve found that we needed to adapt architectural principles from specialized fields such as HPC to get beyond performance plateaus encountered with more generic approaches. Some key examples include:
* Write very simple single threaded code, instead of complex algorithms
* Parallelize by running multiple copies of simple single threaded code, instead of using concurrent algorithms
* Separate the data plane from the control plane, instead of slowing data for control
* Write compact, array-based data structures with minimal indirection, instead of pointer-based data structures and uncontrolled allocation
Is SQLcl the Next Generation of SQL*Plus?Zohar Elkayam
Session from ILOUG I presented in May, 2016
Introducing the new tool from the developers of SQL Developer: SQLcl – a new command line tool from the SQL Developer team that might replace SQL*Plus and all of its functions which has been around for over 30 years!
In this session, we will explore the new functionality of the SQLcl, and use a live demonstration to show what SQLcl has to offer over the old SQL*Plus. We will use real life example to see what makes this tool such a time saver in day-to-day tasks for DBAs and developers who prefer using the command line interface.
JavaOne2016 - Microservices: Terabytes in Microseconds [CON4516]Malin Weiss
By leveraging memory-mapped files, Speedment and the Chronicle Engine supports large Java maps that easily can exceed the size of your server’s RAM.Because the Java maps are mapped onto files, these maps can be shared instantly between several microservice JVMs and new microservice instances can be added, removed, or restarted very quickly. Data can be retrieved with predictable ultralow latency for a wide range of operations. The solution can be synchronized with an underlying database so that your in-memory maps will be consistently “alive.” The mapped files can be tens of terabytes, which has been done in real-world deployment cases, and a large number of micro services can share these maps simultaneously. Learn more in this session.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
9. Disadvantages
•
•
•
•
No secondary indexes from the box
•
Available through additional table and
coprocessors
No transactions
•
CAS support, row locks, counters
Not good for multiple data centers
•
Master-slave replication
No complex query processing
9
24. Client connection
•
•
•
•
Ask ZK to get address -ROOT- region
From -ROOT- gets the RS with .META.
From .META. gets the address of RS
Cache all these data
24
25. Write path
•
•
•
Write to HLog (one per RS)
Write to Memstore (one per CF)
If Memstore is full, flush is requested
•
This request is processed async by
another thread
25
26. On Memstore flush
•
•
On flush Memstore is written to HDFS file
•
Memstore is cleared
The last operation sequence number is
stored
26
27. Region splits
•
•
Region is closed
•
.META. is updated (parent and daughters)
Creates directory structure for new
regions (multithreaded)
27
28. Region splits
•
•
•
Master is notified for load balancing
All steps are tracked in ZK
Parent cleaned up eventually
28
29. Compaction
•
When Memstore is flushed new store file
is created
•
Compaction is a housekeeping mechanism
which merges store files
•
•
Come in two varieties: minor and major
The type is determined when the
compaction check is executed
29
32. Major compaction
•
•
Compact all files into a single file
•
Starts periodically (each 24 hours by
default)
•
Checks tombstone markers and removes
data
Might be promoted from the minor
compaction
32
34. HFile format
•
•
•
•
•
Immutable (trailer is written once)
Trailer has pointers to other blocks
Indexes store offsets to data, metadata
Block size by default 64K
Block contains magic and number of key
values
34
39. Read path
•
•
There is no single index with logical rows
•
Gets are the same as scans
Data is stored in several files and
Memstore per column family
39
45. CRUD
•
•
•
•
•
All operations do through HTable object
All operations per-row atomic
Row lock is available
Batch is available
Looks like this:
HTable table = new HTable("table");
table.put(new Put(...));
45
47. Put
•
Put add(byte[] family, byte[] qualifier, byte[]
value)
•
Put add(byte[] family, byte[] qualifier, long
ts, byte[] value)
•
Put add(KeyValue kv)
47
49. Write buffer
•
•
•
•
There is a client-side write buffer
Sorts data by Region Servers
2Mb by default
Controlled by:
table.setAutoFlush(false);
table.flushCommits();
49
52. Get
•
Returns strictly one row
Result res = table.get(new Get(row));
•
Can ask if row exists
boolean e = table.exist(new Get(row));
•
Filter can be applied
52
53. Row lock
•
•
Region server provide a row lock
Client can ask for explicit lock:
RowLock lockRow(byte[] row) throws IOException;
void unlockRow(RowLock rl) throws IOException;
•
Do not use if you do not have to
53
54. Scan
•
•
•
•
Iterator over a number of rows
Can be defined start and end rows
Leased for amount of time
Batching and caching is available
54
55. Scan
final Scan scan = new Scan();
scan.addFamily(Bytes.toBytes("colfam1"));
scan.setBatch(5);
scan.setCaching(5);
scan.setMaxVersions(1);
scan.setTimeStamp(0);
scan.setStartRow(Bytes.toBytes(0));
scan.setStartRow(Bytes.toBytes(1000));
ResultScanner scanner = table.getScanner(scan);
for (Result res : scanner) {
...
}
scanner.close();
55
56. Scan
•
•
•
Caching is for rows
Batching is for columns
Result.next() will return next set of
batched cols (more than one for row)
56
58. Counters
•
•
•
•
Any column can be treated as counters
Atomic increment without locking
Can be applied for multiple rows
Looks like this:
long cnt = table.incrementColumnValue(Bytes.toBytes("20131028"),
Bytes.toBytes("daily"), Bytes.toBytes("hits"), 1);
58
59. HTable pooling
•
•
•
Creating an HTable instance is expensive
HTable is not thread safe
Should be created on start up and closed
on application exit
59
65. Email example
Approach #1: user id is raw key, all messages
in one cell
<userId> : <cf> : <col> : <ts>:<messages>
12345 : data : msgs : 1307097848 : ${msg1}, ${msg2}...
65
66. Email example
Approach #1: user id is raw key, all messages
in one cell
<userId> : <cf> : <colval> : <messages>
•
•
•
All in one request
Hard to find anything
Users with a lot of messages
66
67. Email example
Approach #2: user id is raw key, message id in
column family
<userId> : <cf> : <msgId> :
<timestamp> : <email-message>
12345 : data : 5fc38314-e290-ae5da5fc375d : 1307097848 : "Hi, ..."
12345 : data : 725aae5f-d72e-f90f3f070419 : 1307099848 : "Welcome, and ..."
12345 : data : cc6775b3-f249-c6dd2b1a7467 : 1307101848 : "To Whom It ..."
12345 : data : dcbee495-6d5e-6ed48124632c : 1307103848 : "Hi, how are ..."
67
68. Email example
Approach #2: user id is raw key, message id in
column family
<userId> : <cf> : <msgId> :
<timestamp> : <email-message>
Can load data on demand
•
•
•
Find anything only with filter
Users with a lot of messages
68
69. Email example
Approach #3: raw id is a <userId> <msgId>
<userId> - <msgId>: <cf> : <qualifier> :
<timestamp> : <email-message>
12345-5fc38314-e290-ae5da5fc375d : data : : 1307097848 : "Hi, ..."
12345-725aae5f-d72e-f90f3f070419 : data : : 1307099848 : "Welcome, and ..."
12345-cc6775b3-f249-c6dd2b1a7467 : data : : 1307101848 : "To Whom It ..."
12345-dcbee495-6d5e-6ed48124632c : data : : 1307103848 : "Hi, how are ..."
69
70. Email example
Approach #3: raw id is a <userId> <msgId>
<userId> - <msgId>: <cf> : <qualifier> :
<timestamp> : <email-message>
Can load data on demand
•
•
•
Find message with row key
Users with a lot of messages
70