Hbase in action - Chapter 09: Deploying HBase
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBase In Action - Chapter 10 - Operationsphanleson
HBase In Action - Chapter 10: Operations
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBase In Action - Chapter 10 - Operationsphanleson
HBase In Action - Chapter 10: Operations
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
This is the extended deck I used for my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
This presentation covers accessing HBase using Big SQL. It starts by going over general HBase concepts, than delves into how Big SQL adds an SQL layer on top of HBase (via HBase storage handler), secondary index support, queries, etc.
Policy based cluster management in oracle 12c Anju Garg
Oracle Grid Infrastructure 12c enhances the use of server pools by introducing server attributes e.g. memory, CPU_count etc. which can be associated with each server. Server pools can be configured so that their members belong to a category of servers, which share a particular set of attributes. Moreover, administrators can maintain a library of policies and switch between them as required rather than manually reallocating servers to various server pools based on workload. This paper discusses in detail the new features of policy based cluster management in 12c.
This presentation by Bruce Momjian. Co-Founder of the Global PostgreSQL Development team and a Senior Architect at EDB. He demonstrates how to use arrays, geometry and JSON for NoSQL data types to overcome restrictions of relational storage to support new innovative applications, specifically by storing and indexing multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage.
Postgres has always had strong support for relational storage. However, there are some cases where relational storage might be inefficient or overly restrictive.
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
This is the hands-on-lab document I created accompanying my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
*Contact me for data files*
This lab has 3 independant parts:
Part I - Creating Big SQL Tables and Loading Data
(exploring different ways to create and load HBase tables with Big SQL. includes an optional section on HBase access via JAQL)
Part II - Query Handling
(how to query HBase tables with Big SQL)
Part III - Connecting to Big SQL Server via JDBC
(using BIRT, a business intelligence and reporting tool, to run a simple report on a tpch orders table showcasing use of the BigSQL JDBC driver)
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
How the new operation of Hadoop Distributed FIle System (HDFS) -- Append works. The internals of the processing. The new states that are more than the write operation.
This presentation goes over the PGSQL 9.3 key features along with several other exciting additions released in early September 2013. These features will also be available in the 9.3 release of Postgres Plus Advanced Server.
Include_dir configuration directive
Copy freeze
Custom background workers
Additional JSON functionality
Lateral join
Parallel pg_dump pg_isready
Posix shared memory/mmap
Event triggers
Materialized views
Recursive views
Updateable views
Writeable foreign tables / postgres_fdw
Streaming only remastering
Fast failover
Architecture-independent streaming
pg_basebackup recovery.conf autosetup
Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
This is the extended deck I used for my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
This presentation covers accessing HBase using Big SQL. It starts by going over general HBase concepts, than delves into how Big SQL adds an SQL layer on top of HBase (via HBase storage handler), secondary index support, queries, etc.
Policy based cluster management in oracle 12c Anju Garg
Oracle Grid Infrastructure 12c enhances the use of server pools by introducing server attributes e.g. memory, CPU_count etc. which can be associated with each server. Server pools can be configured so that their members belong to a category of servers, which share a particular set of attributes. Moreover, administrators can maintain a library of policies and switch between them as required rather than manually reallocating servers to various server pools based on workload. This paper discusses in detail the new features of policy based cluster management in 12c.
This presentation by Bruce Momjian. Co-Founder of the Global PostgreSQL Development team and a Senior Architect at EDB. He demonstrates how to use arrays, geometry and JSON for NoSQL data types to overcome restrictions of relational storage to support new innovative applications, specifically by storing and indexing multiple values, even unrelated ones, in a single database field. Such storage allows for greater efficiency and access simplicity, and can also avoid the negatives of entity-attribute-value (eav) storage.
Postgres has always had strong support for relational storage. However, there are some cases where relational storage might be inefficient or overly restrictive.
Hands-on-Lab: Adding Value to HBase with IBM InfoSphere BigInsights and BigSQLPiotr Pruski
This is the hands-on-lab document I created accompanying my presentation at the Information On Demand 2013 conference for Session Number 1687 - Adding Value to HBase with IBM InfoSphere BigInsights and BigSQL.
*Contact me for data files*
This lab has 3 independant parts:
Part I - Creating Big SQL Tables and Loading Data
(exploring different ways to create and load HBase tables with Big SQL. includes an optional section on HBase access via JAQL)
Part II - Query Handling
(how to query HBase tables with Big SQL)
Part III - Connecting to Big SQL Server via JDBC
(using BIRT, a business intelligence and reporting tool, to run a simple report on a tpch orders table showcasing use of the BigSQL JDBC driver)
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
How the new operation of Hadoop Distributed FIle System (HDFS) -- Append works. The internals of the processing. The new states that are more than the write operation.
This presentation goes over the PGSQL 9.3 key features along with several other exciting additions released in early September 2013. These features will also be available in the 9.3 release of Postgres Plus Advanced Server.
Include_dir configuration directive
Copy freeze
Custom background workers
Additional JSON functionality
Lateral join
Parallel pg_dump pg_isready
Posix shared memory/mmap
Event triggers
Materialized views
Recursive views
Updateable views
Writeable foreign tables / postgres_fdw
Streaming only remastering
Fast failover
Architecture-independent streaming
pg_basebackup recovery.conf autosetup
For more information contact: emailus@marcusevans.com
An interview with: Oscar Franco the Executive President at Amafore, and a speaker at the marcus evans Latin Investors Summit 2013, discusses why international and local investors should look to invest in Mexican assets.
Join the 2015 Latin Investors Summit along with leading investors and global asset managers in an intimate environment for a focused discussion of key new drivers shaping institutional investment strategies today.
For more information contact: emailus@marcusevans.com
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
Hadoop Interview Questions and Answers by rohit kapakapa rohit
Hadoop Interview Questions and Answers - More than 130 real time questions and answers covering hadoop hdfs,mapreduce and administrative concepts by rohit kapa
Introduction to HBase. HBase is a NoSQL databases which experienced a tremendous increase in popularity during the last years. Large companies like Facebook, LinkedIn, Foursquare are using HBase. In this presentation we will address questions like: what is HBase?, and compared to relational databases?, what is the architecture?, how does HBase work?, what about the schema design?, what about the IT ressources?. Questions that should help you consider whether this solution might be suitable in your case.
The DrupalCampLA 2011 presentation on backend performance. The slides go over optimizations that can be done through the LAMP (or now VAN LAMMP stack for even more performance) to get everything up and running.
Building Apache Cassandra clusters for massive scaleAlex Thompson
Covering theory and operational aspects of bring up Apache Cassandra clusters - this presentation can be used as a field reference. Presented by Alex Thompson at the Sydney Cassandra Meetup.
By upgrading from the legacy solution we tested to the new Intel processor-based Dell and VMware solution, you could do 18 times the work in the same amount of space. Imagine what that performance could mean to your business: Consolidate workloads from across your company, lower your power and cooling bills, and limit datacenter expansion in the future, all while maintaining a consistent user experience—the list of potential benefits is huge.
Try running DPACK, which can help you identify bottlenecks in your environment and inform you about your current performance needs. Then consider how the consolidation ratio we proved could be helpful for your company. The Intel processor-powered Dell PowerEdge R730 solution with VMware vSphere and Dell Storage SC4020, also powered by Intel, could be the right destination for your upgrade journey.
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
ABSTRACT : Based on the cost saving, this Hadoop distributed cluster based on raspberry is designed for the storage and processing of massive data. This paper expounds the two core technologies in the Hadoop software framework - HDFS distributed file system architecture and MapReduce distributed processing mechanism. The construction method of the cluster is described in detail, and the Hadoop distributed cluster platform is successfully constructed based on the two raspberry factions. The technical knowledge about Hadoop is well understood in theory and practice.
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
Learning spark ch01 - Introduction to Data Analysis with Spark
References to Spark Course
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
Learning spark ch04 - Working with Key/Value Pairsphanleson
Learning spark ch04 - Working with Key/Value Pairs
Course : Introduction to Big Data with Apache Spark : http://ouo.io/Mqc8L5
Course : Spark Fundamentals I : http://ouo.io/eiuoV
Course : Functional Programming Principles in Scala : http://ouo.io/rh4vv
Executive Directors Chat Leveraging AI for Diversity, Equity, and InclusionTechSoup
Let’s explore the intersection of technology and equity in the final session of our DEI series. Discover how AI tools, like ChatGPT, can be used to support and enhance your nonprofit's DEI initiatives. Participants will gain insights into practical AI applications and get tips for leveraging technology to advance their DEI goals.
Macroeconomics- Movie Location
This will be used as part of your Personal Professional Portfolio once graded.
Objective:
Prepare a presentation or a paper using research, basic comparative analysis, data organization and application of economic information. You will make an informed assessment of an economic climate outside of the United States to accomplish an entertainment industry objective.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Normal Labour/ Stages of Labour/ Mechanism of LabourWasim Ak
Normal labor is also termed spontaneous labor, defined as the natural physiological process through which the fetus, placenta, and membranes are expelled from the uterus through the birth canal at term (37 to 42 weeks
2024.06.01 Introducing a competency framework for languag learning materials ...Sandy Millin
http://sandymillin.wordpress.com/iateflwebinar2024
Published classroom materials form the basis of syllabuses, drive teacher professional development, and have a potentially huge influence on learners, teachers and education systems. All teachers also create their own materials, whether a few sentences on a blackboard, a highly-structured fully-realised online course, or anything in between. Despite this, the knowledge and skills needed to create effective language learning materials are rarely part of teacher training, and are mostly learnt by trial and error.
Knowledge and skills frameworks, generally called competency frameworks, for ELT teachers, trainers and managers have existed for a few years now. However, until I created one for my MA dissertation, there wasn’t one drawing together what we need to know and do to be able to effectively produce language learning materials.
This webinar will introduce you to my framework, highlighting the key competencies I identified from my research. It will also show how anybody involved in language teaching (any language, not just English!), teacher training, managing schools or developing language learning materials can benefit from using the framework.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
Francesca Gottschalk - How can education support child empowerment.pptxEduSkills OECD
Francesca Gottschalk from the OECD’s Centre for Educational Research and Innovation presents at the Ask an Expert Webinar: How can education support child empowerment?
3. 09/24/15
9.1 Planning your cluster
Planning an HBase cluster includes planning the underlying Hadoop
cluster.
This section will highlight the considerations to keep in mind when
choosing hardware and how the roles (HBase Master,
RegionServers, ZooKeeper, and so on) should be deployed on the
cluster.
Prototype cluster
A prototype cluster is one that doesn’t have strict SLAs, and it’s okay for it to
go
down.
Collocate the HBase Master with the Hadoop NameNode and JobTracker on
the same node.
It typically has fewer than 10 nodes.
It’s okay to collocate multiple services on a single node in a prototype cluster.
4–6 cores, 24–32 GB RAM, and 4 disks per node should be a good place to
start.
4. 09/24/15
9.1 Planning your cluster (con't)
Small production cluster (10–20 servers) : Generally, you shouldn’t have
fewer than 10 nodes in a production HBase cluster.
Fewer than 10 slave nodes is hard to make operationalize.
Consider relatively better hardware for the Master nodes if you’re
deploying a production cluster. Dual power supplies and perhaps RAID
are the order of the day.
Small production clusters with not much traffic/workload can have
services collocated.
A single HBase Master is okay for small clusters.
A single ZooKeeper is okay for small clusters and can be collocated with
the HBase Master. If the host running the NameNode and JobTracker is
beefy enough, put ZooKeeper and HBase Master on it too. This will save
you having to buy an extra machine.
A single HBase Master and ZooKeeper limits serviceability.
6. 09/24/15
9.1 Planning your cluster (con't)
Medium production cluster (up to ~50 servers)
Up to 50 nodes, possibly in production, would fall in this category.
We recommend that you not collocate HBase and MapReduce for
performance reasons. If you do collocate, deploy NameNode and
JobTracker on separate hardware.
Three ZooKeeper and three HBase Master nodes should be deployed,
especially if this is a production system. You don’t need three HBase
Masters and can do with two; but given that you already have three
ZooKeeper nodes and are sharing ZooKeeper and HBase Master, it
doesn’t hurt to have a third Master.
Don’t cheap out on the hardware for the NameNode and Secondary
NameNodes.
7. 09/24/15
9.1 Planning your cluster (con't)
Large production cluster (>~50 servers)
Everything for the medium-sized cluster holds true, except that you may
need five ZooKeeper instances that can also collocate with HBase
Masters.
Make sure NameNode and Secondary NameNode have enough memory,
depending on the storage capacity of the cluster.
Hadoop Master nodes
Have redundancy at the hardware level for the various components:
NICs, RAID disks
There is enough RAM to be able to address the entire namespace :
Namenode
The Secondary NameNode should have the same hardware as the
NameNode.
8. 09/24/15
9.1 Planning your cluster (con't)
HBase Master
HBase Master is a lightweight process and doesn’t need a lot of resources,
but it’s wise to keep it on independent hardware if possible.
Have multiple HBase Masters for redundancy.
cores, 8–16 GB RAM, and 2 disks are more than enough for the HBase
Master nodes.
Hadoop DataNodes and HBase RegionServers
DataNodes and RegionServers are always collocated. They serve the
traffic. Avoid running MapReduce on the same nodes.
8–12 cores, 24–32 GB RAM, 12x 1 TB disks is a good place to start.
You can increase the number of disks for higher storage density, but
don’t go too high or replication will take time in the face of node or disk
failure.
Get a larger number of reasonably sized boxes instead of fewer beefy
ones.
9. 09/24/15
9.1 Planning your cluster (con't)
ZooKeeper(s)
ZooKeepers are lightweight but latency sensitive.
Hardware similar to that of the HBase Master works fine if you’re looking to
deploy them separately.
HBase Master and ZooKeeper can be collocated safely as long as you make
sure ZooKeeper gets a dedicated spindle for its data persistence.
Add a disk (for the ZooKeeper data to be persisted on) to the configuration
mentioned in the HBase Master section if you’re collocating.
10. 09/24/15
9.1 Planning your cluster (con't)
What about the cloud?
At least 16 GB RAM. HBase RegionServers are RAM hungry. But don’t give
them too much, or you’ll run into Java GC issues. We’ll talk about tuning GC
later in this chapter.
Have as many disks as possible. Most EC2 instances at the time of writing
don’t provide a high number of disks.
A fatter network is always better.
Get ample compute based on your individual use case. MapReduce jobs need
more compute power than a simple website-serving database.
It’s important that you’re aware of the arguments in favor of and against
deploying HBase in the cloud.
Cost
Ease of use
Operations
Reliability
Lack of customization
Performance
Security
11. 09/24/15
9.2 Deploying software
Managing and deploying on a cluster of machines, especially
in production, is nontrivial and needs careful work.
When deploying to a large number of machines, we
recommend that you automate the process as much as
possible.
Our intent is to introduce you to all the ways you can think
about deployments.
Whirr: deploying in the cloud : If you’re looking to deploy HBase in the
cloud, you should get Apache Whirr to make your life easier.
12. 09/24/15
9.3 Distributions
This section will cover installing HBase on your cluster. Numerous
distributions (or packages) of HBase are available, and each has
multiple releases. The most notable distributions currently are the
stock Apache distribution and Cloudera’s CDH:
Apache : The Apache HBase project is the parent project where all the
development for HBase happens.
Cloudera’s CDH : Cloudera is a company that has its own distribution
containing Hadoop and other components in the ecosystem, including
HBase.
We recommend using Cloudera’s CDH distribution. It typically includes
more patches than the stock releases to add stability, performance
improvements, and sometimes features.
CDH is also better tested than the Apache releases and is running in
production in more clusters than stock Apache. These are points we
recommend thinking about before you choose the distribution for your
cluster.
13. 09/24/15
9.3.1 Using the stock Apache distribution
To install the stock Apache distribution, you need to download
the tarballs and install those into a directory of your choice.
14. 09/24/15
9.3.2 Using Cloudera’s CDH distribution
The current release for CDH is
CDH4u0 which is based on the
0.92.1 Apache release. The
installation instructions are
environment specific; the
fundamental steps are as follows:
15. 09/24/15
9.4 Configuration
Deploying HBase requires configuring Linux, Hadoop, and, of
course, HBase.
In order to configure the system in the most optimal manner,
it’s important that you understand the parameters and the
implications of tuning them one way or another.
16. 09/24/15
9.4.1 HBase configurations
ENVIRONMENT
CONFIGURATIONS : hbase-
env.sh things like the Java heap
size, garbage-collection
parameters, and other
environment variables are set
here.
20. 09/24/15
9.4.3 Operating system configurations
HBase is a database and needs to keep files open so you can read from and
write to them without incurring the overhead of opening and closing them
on each operation.
To increase the open-file limit for the user, put the following statements in
your /etc/ security/limits.conf file for the user that will run the Hadoop and
HBase daemons. CDH does this for you as a part of the package installation:
hadoopuser nofile 32768
hbaseuser nofile 32768
hadoopuser soft/hard nproc 32000
hbaseuser soft/hard nproc 32000
Another important configuration parameter to tune is the swap behavior.
$ sysctl -w vm.swappiness=0
21. 09/24/15
9.5 Managing the daemons
The relevant services need to be started on each node of the
cluster :
Use the bundled start and stop scripts.
Cluster SSH (http://sourceforge.net/projects/clusterssh) is a useful tool if
you’re dealing with a cluster of machines. It allows you to simultaneously run
the same shell commands on a cluster of machines that you’re logged in to in
separate windows.
Homegrown scripts are always an option.
Use management software like Cloudera Manager that allows you to manage
all the services on the cluster from a single web-based UI.
23. 09/24/15
9.5 Summary
In this chapter, we covered the various aspects of deploying
HBase in a fully distributed environment for your production
application.
We talked about the considerations to take into account when choosing
hardware for your cluster, including whether to deploy on your own
hardware or in the cloud.
This chapter gets you ready to think about putting HBase in production.
Editor's Notes
This assumes you aren’t collocating MapReduce with HBase, which is the
recommended way of running HBase if you’re using it for low-latency access.
Collocating the two would require more cores, RAM, and spindles.
http://ouo.io/uaiKO
Two of the important things configured here are the memory allocation and GC.
It’s critical to pay attention to these if you want to extract decent performance from
your HBase deployment. HBase is a database and needs lots of memory to provide lowlatency
reads and writes.
We don’t recommend that you give the RegionServers more than 15 GB of heap in a
production HBase deployment. The reason for not going over the top and allocating
larger heaps than that is that GC starts to become too expensive.
-Xmx8g -Xms8g -Xmn128m -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70