Apache HBase is a technology that turns everything in Hadoop infrastructure upside down. An elephant cannot become an antelope, but yet it is possible to do a group dance on its back.
Based on "HBase, dances on the elephant back" presentation success I have prepared its update for JavaDay 2014 Kyiv. Again, it is about the product which revolutionary changes everything inside Hadoop infrastructure: Apache HBase. But here focus is shifted to integration and more advanced topics keeping presentation yet understandable for technology newcomers.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
HBase is a distributed, column-oriented database that runs on top of Hadoop and HDFS, providing Bigtable-like capabilities for massive tables of structured and unstructured data. It is modeled after Google's Bigtable and provides a distributed, scalable, versioned storage system with strong consistency for random read/write access to billions of rows and millions of columns. HBase is well-suited for handling large datasets and providing real-time read/write access across clusters of commodity servers.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
In this session you will learn:
1. History of hadoop
2. Hadoop Ecosystem
3. Hadoop Animal Planet
4. What is Hadoop?
5. Distinctions of hadoop
6. Hadoop Components
7. The Hadoop Distributed Filesystem
8. Design of HDFS
9. When Not to use Hadoop?
10. HDFS Concepts
11. Anatomy of a File Read
12. Anatomy of a File Write
13. Replication & Rack awareness
14. Mapreduce Components
15. Typical Mapreduce Job
Based on "HBase, dances on the elephant back" presentation success I have prepared its update for JavaDay 2014 Kyiv. Again, it is about the product which revolutionary changes everything inside Hadoop infrastructure: Apache HBase. But here focus is shifted to integration and more advanced topics keeping presentation yet understandable for technology newcomers.
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
HBase is a distributed, column-oriented database that runs on top of Hadoop and HDFS, providing Bigtable-like capabilities for massive tables of structured and unstructured data. It is modeled after Google's Bigtable and provides a distributed, scalable, versioned storage system with strong consistency for random read/write access to billions of rows and millions of columns. HBase is well-suited for handling large datasets and providing real-time read/write access across clusters of commodity servers.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
In this session you will learn:
1. Meet MapReduce
2. Word Count Algorithm – Traditional approach
3. Traditional approach on a Distributed System
4. Traditional approach – Drawbacks
5. MapReduce Approach
6. Input & Output Forms of a MR program
7. Map, Shuffle & Sort, Reduce Phase
8. WordCount Code walkthrough
9. Workflow & Transformation of Data
10. Input Split & HDFS Block
11. Relation between Split & Block
12. Data locality Optimization
13. Speculative Execution
14. MR Flow with Single Reduce Task
15. MR flow with multiple Reducers
16. Input Format & Hierarchy
17. Output Format & Hierarchy
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
In this session you will learn:
1. History of hadoop
2. Hadoop Ecosystem
3. Hadoop Animal Planet
4. What is Hadoop?
5. Distinctions of hadoop
6. Hadoop Components
7. The Hadoop Distributed Filesystem
8. Design of HDFS
9. When Not to use Hadoop?
10. HDFS Concepts
11. Anatomy of a File Read
12. Anatomy of a File Write
13. Replication & Rack awareness
14. Mapreduce Components
15. Typical Mapreduce Job
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Jonathan Seidman
Using Hadoop and Hive, Orbitz analyzed large amounts of web analytics data to optimize travel search and gain insights. They loaded over 500GB of daily log data into Hadoop and used Hive to run SQL-like queries to derive metrics like the position of booked hotels in search results and booking position trends by location. Statistical analysis in R helped explore trends, correlations and outliers in the Hive datasets to help machine learning applications.
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
This talk with focus on two key aspects of applications that are using the HBase APIs. The first part will provide a basic overview of how HBase works followed by an introduction to the HBase APIs with a simple example. The second part will extend what we've learned to secure the HBase application running on MapR's industry leading Hadoop.
Keys Botzum is a Senior Principal Technologist with MapR Technologies. He has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
The document discusses scaling HDFS to manage billions of files. It describes how HDFS usage has grown from millions of files in 2007 to potentially billions of files in the future. To address this, the speakers propose storing HDFS metadata in a key-value store like LevelDB instead of solely in memory. They evaluate this approach and find comparable performance to HDFS for most operations. Future work includes improving operations like compaction and failure recovery in the new architecture.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
HBase is an open-source, distributed, versioned, key-value database modeled after Google's Bigtable. It is designed to store large volumes of sparse data across commodity hardware. HBase uses Hadoop for storage and provides real-time read and write capabilities. It scales horizontally and is highly fault tolerant through its master-slave architecture and use of Zookeeper for coordination. Data in HBase is stored in tables and indexed by row keys for fast lookup, with columns grouped into families and versions stored by timestamps.
HBase is an open-source, non-relational, distributed database built on top of Hadoop and HDFS. It provides BigTable-like capabilities for Hadoop, including fast random reads and writes. HBase stores data in tables comprised of rows, columns, and versions. It is designed to handle large volumes of sparse or unstructured data across clusters of commodity hardware. HBase uses a master-slave architecture with RegionServers storing and serving data and a single active MasterServer managing the cluster metadata and load balancing.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Apache Spark on Apache HBase: Current and Future HBaseCon
- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
The document provides an introduction to Hadoop and big data concepts. It discusses key topics like what big data is characterized by the three V's of volume, velocity and variety. It then defines Hadoop as a framework for distributed storage and processing of large datasets using commodity hardware. The rest of the document outlines the main components of the Hadoop ecosystem including HDFS, YARN, MapReduce, Hive, Pig, Zookeeper, Flume and Sqoop and provides brief descriptions of each.
Current big data technology scope overview prepared for V.I.Tech and Wellcentive companies. Answers questions why we are taking these products and what do we really do with them on very high level.
This is the introductory presentation on HBase given by Hayden Marchant in the monthly Amobee Tech Talk.
In this session, we'll learn about HBase, a NoSQL database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns.
HBase is an open-source, non-relational distributed column-oriented database, is linearly scalable, and is designed to run on commodity hardware. HBase clusters can be in the hundreds and thousands of nodes, serving extraordinary amounts of information. Tight integration with Hadoop gives way to allows powerful analytical processing on data residing in HBase.
Sept 17 2013 - THUG - HBase a Technical IntroductionAdam Muise
HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Facebook - Jonthan Gray - Hadoop World 2010Cloudera, Inc.
The document summarizes HBase use at Facebook, including its development and future work. HBase is used for incremental updates to data warehouses, high frequency analytics, and write-intensive workloads. Development includes Hive integration, master high availability, and random read optimizations. Future work focuses on coprocessors, intelligent load balancing, and cluster performance.
This document provides an overview of Apache Hadoop and HBase. It begins with an introduction to why big data is important and how Hadoop addresses storing and processing large amounts of data across commodity servers. The core components of Hadoop, HDFS for storage and MapReduce for distributed processing, are described. An example MapReduce job is outlined. The document then introduces the Hadoop ecosystem, including Apache HBase for random read/write access to data stored in Hadoop. Real-world use cases of Hadoop at companies like Yahoo, Facebook and Twitter are briefly mentioned before addressing questions.
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Jonathan Seidman
Using Hadoop and Hive, Orbitz analyzed large amounts of web analytics data to optimize travel search and gain insights. They loaded over 500GB of daily log data into Hadoop and used Hive to run SQL-like queries to derive metrics like the position of booked hotels in search results and booking position trends by location. Statistical analysis in R helped explore trends, correlations and outliers in the Hive datasets to help machine learning applications.
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
This talk with focus on two key aspects of applications that are using the HBase APIs. The first part will provide a basic overview of how HBase works followed by an introduction to the HBase APIs with a simple example. The second part will extend what we've learned to secure the HBase application running on MapR's industry leading Hadoop.
Keys Botzum is a Senior Principal Technologist with MapR Technologies. He has over 15 years of experience in large scale distributed system design. At MapR his primary responsibility is working with customers as a consultant, but he also teaches classes, contributes to documentation, and works with MapR engineering. Previously he was a Senior Technical Staff Member with IBM and a respected author of many articles on WebSphere Application Server as well as a book. He holds a Masters degree in Computer Science from Stanford University and a B.S. in Applied Mathematics/Computer Science from Carnegie Mellon University.
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Scaling HDFS to Manage Billions of Files with Key-Value StoresDataWorks Summit
The document discusses scaling HDFS to manage billions of files. It describes how HDFS usage has grown from millions of files in 2007 to potentially billions of files in the future. To address this, the speakers propose storing HDFS metadata in a key-value store like LevelDB instead of solely in memory. They evaluate this approach and find comparable performance to HDFS for most operations. Future work includes improving operations like compaction and failure recovery in the new architecture.
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
HBase is an open-source, distributed, versioned, key-value database modeled after Google's Bigtable. It is designed to store large volumes of sparse data across commodity hardware. HBase uses Hadoop for storage and provides real-time read and write capabilities. It scales horizontally and is highly fault tolerant through its master-slave architecture and use of Zookeeper for coordination. Data in HBase is stored in tables and indexed by row keys for fast lookup, with columns grouped into families and versions stored by timestamps.
HBase is an open-source, non-relational, distributed database built on top of Hadoop and HDFS. It provides BigTable-like capabilities for Hadoop, including fast random reads and writes. HBase stores data in tables comprised of rows, columns, and versions. It is designed to handle large volumes of sparse or unstructured data across clusters of commodity hardware. HBase uses a master-slave architecture with RegionServers storing and serving data and a single active MasterServer managing the cluster metadata and load balancing.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Apache Spark on Apache HBase: Current and Future HBaseCon
- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
HBase is a distributed, column-oriented database that stores data in tables divided into rows and columns. It is optimized for random, real-time read/write access to big data. The document discusses HBase's key concepts like tables, regions, and column families. It also covers performance tuning aspects like cluster configuration, compaction strategies, and intelligent key design to spread load evenly. Different use cases are suitable for HBase depending on access patterns, such as time series data, messages, or serving random lookups and short scans from large datasets. Proper data modeling and tuning are necessary to maximize HBase's performance.
The document provides an introduction to Hadoop and big data concepts. It discusses key topics like what big data is characterized by the three V's of volume, velocity and variety. It then defines Hadoop as a framework for distributed storage and processing of large datasets using commodity hardware. The rest of the document outlines the main components of the Hadoop ecosystem including HDFS, YARN, MapReduce, Hive, Pig, Zookeeper, Flume and Sqoop and provides brief descriptions of each.
Current big data technology scope overview prepared for V.I.Tech and Wellcentive companies. Answers questions why we are taking these products and what do we really do with them on very high level.
This is the introductory presentation on HBase given by Hayden Marchant in the monthly Amobee Tech Talk.
In this session, we'll learn about HBase, a NoSQL database that provides real-time, random read and write access to tables meant to store billions of rows and millions of columns.
HBase is an open-source, non-relational distributed column-oriented database, is linearly scalable, and is designed to run on commodity hardware. HBase clusters can be in the hundreds and thousands of nodes, serving extraordinary amounts of information. Tight integration with Hadoop gives way to allows powerful analytical processing on data residing in HBase.
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
This document provides an overview of Apache Hadoop, HDFS, and MapReduce. It describes how Hadoop uses a distributed file system (HDFS) to store large amounts of data across commodity hardware. It also explains how MapReduce allows distributed processing of that data by allocating map and reduce tasks across nodes. Key components discussed include the HDFS architecture with NameNodes and DataNodes, data replication for fault tolerance, and how the MapReduce engine works with a JobTracker and TaskTrackers to parallelize jobs.
Apache Hive is an open source data warehousing framework built on Hadoop. It allows users to query large datasets using SQL and handles parallelization behind the scenes. Hive supports various file formats like ORC, Parquet, and Avro. It uses a directed acyclic graph (DAG) execution engine like Tez or Spark to improve performance over traditional MapReduce. The metastore stores metadata about databases, tables, and partitions to allow data discovery and abstraction. Hive's cost-based optimizer and in-memory query processing features like LLAP improve performance for interactive queries on large datasets.
Impala is a massively parallel processing SQL query engine for Hadoop. It allows users to issue SQL queries directly to their data in Apache Hadoop. Impala uses a distributed architecture where queries are executed in parallel across nodes by Impala daemons. It uses a new execution engine written in C++ with runtime code generation for high performance. Impala also supports commonly used Hadoop file formats and can query data stored in HDFS and HBase.
Impala is a massively parallel processing SQL query engine for Apache Hadoop. It allows real-time queries on large datasets by using a new execution engine written in C++ instead of Java and MapReduce. Impala can process queries in milliseconds to hours by distributing query execution across Hadoop clusters. It uses existing Hadoop file formats and metadata but is optimized for performance through techniques like runtime code generation and in-memory processing.
The document provides an overview of Apache Hadoop and related big data technologies. It discusses Hadoop components like HDFS for storage, MapReduce for processing, and HBase for columnar storage. It also covers related projects like Hive for SQL queries, ZooKeeper for coordination, and Hortonworks and Cloudera distributions.
This document summarizes Tagomori Satoshi's presentation on handling "not so big data" at the YAPC::Asia 2014 conference. It discusses different types of data processing frameworks for various data sizes, from sub-gigabytes up to petabytes. It provides overviews of MapReduce, Spark, Tez, and stream processing frameworks. It also discusses what Hadoop is and how the Hadoop ecosystem has evolved to include these additional frameworks.
Apache Hadoop is an open-source software framework that supports distributed applications and processing of large data sets across clusters of commodity hardware. It is highly scalable, fault-tolerant and allows processing of data in parallel. Hadoop consists of Hadoop Common, HDFS for storage, YARN for resource management and MapReduce for distributed processing. HDFS stores large files across clusters and provides high throughput access to application data. MapReduce allows distributed processing of large datasets across clusters using a simple programming model.
This document provides an overview of big data processing techniques including batch processing using MapReduce and Hive, iterative batch processing using Spark, stream processing using Apache Storm, and OLAP over big data using Dremel and Druid. It discusses techniques such as MapReduce, Hive, Spark RDDs, and Storm tuples for processing large datasets and compares small versus big data approaches. Example usages and technologies for different processing types are also outlined.
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Cognizant
This document discusses big data processing options for optimizing analytical workloads using Hadoop. It provides an overview of Hadoop and its core components HDFS and MapReduce. It also discusses the Hadoop ecosystem including tools like Pig, Hive, HBase, and ecosystem projects. The document compares building Hadoop clusters to using appliances or Hadoop-as-a-Service offerings. It also briefly mentions some Hadoop competitors for real-time processing use cases.
Hadoop is a framework for distributed storage and processing of large datasets across clusters of computers. It addresses problems like hardware failure and combining data after analysis. The core components are HDFS for distributed storage and MapReduce for distributed processing. HDFS stores data as blocks across nodes and handles replication for reliability. The Namenode manages the file system namespace and metadata, while Datanodes store and retrieve blocks. Hadoop supports reliable analysis of large datasets in a distributed manner through its scalable architecture.
The document discusses Hadoop, its components, and how they work together. It covers HDFS, which stores and manages large files across commodity servers; MapReduce, which processes large datasets in parallel; and other tools like Pig and Hive that provide interfaces for Hadoop. Key points are that Hadoop is designed for large datasets and hardware failures, HDFS replicates data for reliability, and MapReduce moves computation instead of data for efficiency.
Business intelligence analyzes data to provide actionable information for decision making. Big data is a $50 billion market by 2017, referring to technologies that capture, store, manage and analyze large variable data collections. Hadoop is an open source framework for distributed storage and processing of large data sets on commodity hardware, enabling businesses to gain insight from massive amounts of structured and unstructured data. It involves components like HDFS for data storage, MapReduce for processing, and others for accessing, storing, integrating, and managing data.
The document provides an introduction to Hadoop and distributed computing, describing Hadoop's core components like MapReduce, HDFS, HBase and Hive. It explains how Hadoop uses a map-reduce programming model to process large datasets in a distributed manner across commodity hardware, and how its distributed file system HDFS stores and manages large amounts of data reliably. Functional programming concepts like immutability and avoiding state changes are important to Hadoop's ability to process data in parallel across clusters.
comprehensive Introduction to NoSQL solutions inside the big data landscape. Graph store? Column store? key Value store? Document Store? redis or memcache? dynamo db? mongo db ? hbase? Cloud or open source?
Hadoop is an open-source framework for distributed storage and processing of large datasets across clusters of commodity hardware. It addresses problems posed by large and complex datasets that cannot be processed by traditional systems. Hadoop uses HDFS for storage and MapReduce for distributed processing of data in parallel. Hadoop clusters can scale to thousands of nodes and petabytes of data, providing low-cost and fault-tolerant solutions for big data problems faced by internet companies and other large organizations.
This document discusses Hadoop and its core components HDFS and MapReduce. It provides an overview of how Hadoop addresses the challenges of big data by allowing distributed processing of large datasets across clusters of computers. Key points include: Hadoop uses HDFS for distributed storage and MapReduce for distributed processing; HDFS works on a master-slave model with a Namenode and Datanodes; MapReduce utilizes a map and reduce programming model to parallelize tasks. Fault tolerance is built into Hadoop to prevent single points of failure.
Similar to HBase, dances on the elephant back. (20)
Public presentations skill is critical for any software engineer career. But here we have some specials which we must account. This presentation is based on my practical experience as solutions architect and speaker of technology conferences in Ukraine. Its main goal is to help engineers to understand and avoid most frequent pitfalls.
You run your SQL-centric infrastructure for 10 years and slowly starting to note you can’t do this way anymore – everything is getting too expensive but your business requires things which are simply impossible without radical changes.
This is exact situation we had 2 years before. So we’d like to show our experience:
- Why and how we came into Big Data?
- Why we choose Apache and Hadoop?
- What to do and what is already done?
- What lessons were learned?
- Hadoop and relational databases: fight or synergy?
- Reactive Big Data manifest.
This document discusses frameworks in the context of big data solutions. It makes several key points:
1. Hadoop provides a stable core infrastructure for building big data solutions, with layers for resource management, distributed processing, file system, and coordination.
2. When going beyond the Hadoop core, frameworks should be selected that have a stable approach, flexible functionality, and an active community to contribute to existing solutions rather than creating new ones.
3. Performance overhead from frameworks is directly paid for with additional computing resources in large clusters, so frameworks should be chosen carefully based on their overhead. Creating new frameworks limits future flexibility the more users it has.
Good software quality matters. How about Big Data and Hadoop? Every day we process large amount of medical information, population scale. Any question why we pay attention to quality? Let's disclosure some points based on our practical cases.
- Quality assurance in Hadoop: fears and reality.
- Practical options for testing infrastructure.
- Test strategy specials.
- Lowering verification efforts.
- Growth driven development: verification at scale.
- I want better elephant: current gaps.
Big Data technologies promise lot of advantages both from business and engineering point of view. But nothing comes for free. 2 years on this path led to lot of pitfalls discovered in term of infrastructure approaches, solutions complexity balance and technology stack selection approaches. Now we intend to warn you and have some fun.
More than year of extremely intensive Big Data technologies development with Hadoop, HBase, MapReduce and ZooKeeper as key technologies. New company that has established infrastructure which grows pretty fast. Lot of experience in networking and distributed systems but completely new enterprise solutions world. What tasks does this bring? What issue and traps? What lessons were learned and what is considered as near future tasks? How embedded developer can enter this new world and what advantages he or she has? What challenges should you be ready to face?
Taking AI to the Next Level in Manufacturing.pdfssuserfac0301
Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as:
1. How quickly AI is being implemented in manufacturing.
2. Which barriers stand in the way of AI adoption.
3. How data quality and governance form the backbone of AI.
4. Organizational processes and structures that may inhibit effective AI adoption.
6. Ideas and approaches to help build your organization's AI strategy.
Trusted Execution Environment for Decentralized Process MiningLucaBarbaro3
Presentation of the paper "Trusted Execution Environment for Decentralized Process Mining" given during the CAiSE 2024 Conference in Cyprus on June 7, 2024.
Have you ever been confused by the myriad of choices offered by AWS for hosting a website or an API?
Lambda, Elastic Beanstalk, Lightsail, Amplify, S3 (and more!) can each host websites + APIs. But which one should we choose?
Which one is cheapest? Which one is fastest? Which one will scale to meet our needs?
Join me in this session as we dive into each AWS hosting service to determine which one is best for your scenario and explain why!
Building Production Ready Search Pipelines with Spark and MilvusZilliz
Spark is the widely used ETL tool for processing, indexing and ingesting data to serving stack for search. Milvus is the production-ready open-source vector database. In this talk we will show how to use Spark to process unstructured data to extract vector representations, and push the vectors to Milvus vector database for search serving.
Monitoring and Managing Anomaly Detection on OpenShift.pdfTosin Akinosho
Monitoring and Managing Anomaly Detection on OpenShift
Overview
Dive into the world of anomaly detection on edge devices with our comprehensive hands-on tutorial. This SlideShare presentation will guide you through the entire process, from data collection and model training to edge deployment and real-time monitoring. Perfect for those looking to implement robust anomaly detection systems on resource-constrained IoT/edge devices.
Key Topics Covered
1. Introduction to Anomaly Detection
- Understand the fundamentals of anomaly detection and its importance in identifying unusual behavior or failures in systems.
2. Understanding Edge (IoT)
- Learn about edge computing and IoT, and how they enable real-time data processing and decision-making at the source.
3. What is ArgoCD?
- Discover ArgoCD, a declarative, GitOps continuous delivery tool for Kubernetes, and its role in deploying applications on edge devices.
4. Deployment Using ArgoCD for Edge Devices
- Step-by-step guide on deploying anomaly detection models on edge devices using ArgoCD.
5. Introduction to Apache Kafka and S3
- Explore Apache Kafka for real-time data streaming and Amazon S3 for scalable storage solutions.
6. Viewing Kafka Messages in the Data Lake
- Learn how to view and analyze Kafka messages stored in a data lake for better insights.
7. What is Prometheus?
- Get to know Prometheus, an open-source monitoring and alerting toolkit, and its application in monitoring edge devices.
8. Monitoring Application Metrics with Prometheus
- Detailed instructions on setting up Prometheus to monitor the performance and health of your anomaly detection system.
9. What is Camel K?
- Introduction to Camel K, a lightweight integration framework built on Apache Camel, designed for Kubernetes.
10. Configuring Camel K Integrations for Data Pipelines
- Learn how to configure Camel K for seamless data pipeline integrations in your anomaly detection workflow.
11. What is a Jupyter Notebook?
- Overview of Jupyter Notebooks, an open-source web application for creating and sharing documents with live code, equations, visualizations, and narrative text.
12. Jupyter Notebooks with Code Examples
- Hands-on examples and code snippets in Jupyter Notebooks to help you implement and test anomaly detection models.
Dive into the realm of operating systems (OS) with Pravash Chandra Das, a seasoned Digital Forensic Analyst, as your guide. 🚀 This comprehensive presentation illuminates the core concepts, types, and evolution of OS, essential for understanding modern computing landscapes.
Beginning with the foundational definition, Das clarifies the pivotal role of OS as system software orchestrating hardware resources, software applications, and user interactions. Through succinct descriptions, he delineates the diverse types of OS, from single-user, single-task environments like early MS-DOS iterations, to multi-user, multi-tasking systems exemplified by modern Linux distributions.
Crucial components like the kernel and shell are dissected, highlighting their indispensable functions in resource management and user interface interaction. Das elucidates how the kernel acts as the central nervous system, orchestrating process scheduling, memory allocation, and device management. Meanwhile, the shell serves as the gateway for user commands, bridging the gap between human input and machine execution. 💻
The narrative then shifts to a captivating exploration of prominent desktop OSs, Windows, macOS, and Linux. Windows, with its globally ubiquitous presence and user-friendly interface, emerges as a cornerstone in personal computing history. macOS, lauded for its sleek design and seamless integration with Apple's ecosystem, stands as a beacon of stability and creativity. Linux, an open-source marvel, offers unparalleled flexibility and security, revolutionizing the computing landscape. 🖥️
Moving to the realm of mobile devices, Das unravels the dominance of Android and iOS. Android's open-source ethos fosters a vibrant ecosystem of customization and innovation, while iOS boasts a seamless user experience and robust security infrastructure. Meanwhile, discontinued platforms like Symbian and Palm OS evoke nostalgia for their pioneering roles in the smartphone revolution.
The journey concludes with a reflection on the ever-evolving landscape of OS, underscored by the emergence of real-time operating systems (RTOS) and the persistent quest for innovation and efficiency. As technology continues to shape our world, understanding the foundations and evolution of operating systems remains paramount. Join Pravash Chandra Das on this illuminating journey through the heart of computing. 🌟
Programming Foundation Models with DSPy - Meetup SlidesZilliz
Prompting language models is hard, while programming language models is easy. In this talk, I will discuss the state-of-the-art framework DSPy for programming foundation models with its powerful optimizers and runtime constraint system.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.
Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slackshyamraj55
Discover the seamless integration of RPA (Robotic Process Automation), COMPOSER, and APM with AWS IDP enhanced with Slack notifications. Explore how these technologies converge to streamline workflows, optimize performance, and ensure secure access, all while leveraging the power of AWS IDP and real-time communication via Slack notifications.
A Comprehensive Guide to DeFi Development Services in 2024Intelisync
DeFi represents a paradigm shift in the financial industry. Instead of relying on traditional, centralized institutions like banks, DeFi leverages blockchain technology to create a decentralized network of financial services. This means that financial transactions can occur directly between parties, without intermediaries, using smart contracts on platforms like Ethereum.
In 2024, we are witnessing an explosion of new DeFi projects and protocols, each pushing the boundaries of what’s possible in finance.
In summary, DeFi in 2024 is not just a trend; it’s a revolution that democratizes finance, enhances security and transparency, and fosters continuous innovation. As we proceed through this presentation, we'll explore the various components and services of DeFi in detail, shedding light on how they are transforming the financial landscape.
At Intelisync, we specialize in providing comprehensive DeFi development services tailored to meet the unique needs of our clients. From smart contract development to dApp creation and security audits, we ensure that your DeFi project is built with innovation, security, and scalability in mind. Trust Intelisync to guide you through the intricate landscape of decentralized finance and unlock the full potential of blockchain technology.
Ready to take your DeFi project to the next level? Partner with Intelisync for expert DeFi development services today!
Fueling AI with Great Data with Airbyte WebinarZilliz
This talk will focus on how to collect data from a variety of sources, leveraging this data for RAG and other GenAI use cases, and finally charting your course to productionalization.
GraphRAG for Life Science to increase LLM accuracyTomaz Bratanic
GraphRAG for life science domain, where you retriever information from biomedical knowledge graphs using LLMs to increase the accuracy and performance of generated answers
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Tatiana Kojar
Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI.
With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...alexjohnson7307
Predictive maintenance is a proactive approach that anticipates equipment failures before they happen. At the forefront of this innovative strategy is Artificial Intelligence (AI), which brings unprecedented precision and efficiency. AI in predictive maintenance is transforming industries by reducing downtime, minimizing costs, and enhancing productivity.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
3. 3www.vitech.com.ua
Is hadoop good for data?
… so
attractive
● Hadoop is open source
framework for big
data. Both distributed
storage and
processing.
● Hadoop is reliable and
fault tolerant with no
rely on hardware for
these properties.
● Hadoop has unique
horisontal scalability.
Currently — from
single computer up to
thousands of cluster
nodes.
4. 4www.vitech.com.ua
Hadoop: classical picture
Hadoop
historical
top view
● HDFS serves as file
system layer
● MapReduce originally
served as distributed
processing framework.
● Native client API is
Java but there are lot
of alternatives.
● But where is SQL
server here?
5. 5www.vitech.com.ua
HBase motivation
● Designed for throughput, not
for latency.
● HDFS blocks are expected to be
large. There is issue with lot of
small files.
● Write once, read many times
ideology.
● MapReduce is not so flexible so
any database built on top of it.
● How about realtime?
So Hadoop is...
7. 7www.vitech.com.ua
So HBASE is for this.
● Open source Google BigTable implementation
with appropriate infrastructure place.
● Realtime, low latency, linear scalability.
● Distributed, reliable and fault tolerant.
● Natural integration with other Hadoop
components.
● No any SQL, secondary indexes out of the box.
● Limited ACID guarantees.
● Really good for massive scans.
8. 8www.vitech.com.ua
Google Bigtable / Hadoop architecture and HBase
High layer applications
MapReduce (Hadoop
MapReduce)
YARN (resource management)
Distributed file system (Google FS, HDFS).
9. 9www.vitech.com.ua
HBASE facts and trends
2006 2007 2008 2009 2010 … 2014 … future
2008, HBase goes OLTP (online transaction
processing). 0.20 is first performance release
2010, HBase becomes
Apache top-level project
HBase 0.92 is considered
production ready release
November 2010, Facebook
elected HBase to implement
new messaging platform
2007, First code is
released as part of
Hadoop 0.15. Focus is on
offline, crawl data storage
2006, Google BigTable
paper is published. HBase
development starts
10. 10www.vitech.com.ua
HBase data paths on conceptual level
Analytics, long running jobs Realtime operations
Adapters
(Hive)
MapReduce API HBase API
Adapters
(Impala)
MapReduce (Hadoop
MapReduce)
YARN (resource management)
Distributed file system (Google FS, HDFS)
● HBase can be used both for long running analytics and real time
low latency operations.
● Third party adapters are possible if you need fast track. Some
functionality and performance drawbacks are the price you pay.
11. 11www.vitech.com.ua
Loose data structure
Book: title, author,
pages, price
Ball: color, size,
material, price
Toy car: color, type,
radio control, price
Kind Price Title Author Pages Color Size Material Type Radio
control
Book + + + +
Ball + + + +
Toy car + + + +
● Data looks like tables with large number of columns.
● Columns set can vary from row to row.
● No table modification is needed to add column to row.
Book #1: Kind, Price, Title, Author, Pages
Ball #1: Kind, Price, Color, Size, Material
Toy car #1: Price, Color, Type +Radio control
Book #2: Kind, Price, Title, Author
12. 12www.vitech.com.ua
Table
Logical data structure
Region
Region
Row
Key Family #1 Family #2 ...
Column Column ... ...
...
...
...
Data is
placed in
tables.
Tables are split
into regions
based on row
key ranges.
Columns are
grouped into
families.Every table row
is identified by
unique row key.
Every row
consists of
columns.
13. 13www.vitech.com.ua
Table
Region
Data storage structure
Region
Row
Key Family #1 Family #2 ...
Column Column ... ...
...
● Data is stored in HFile.
● Families are stored on
disk in separate files.
● Row keys are
indexed in memory.
● Column includes key,
qualifier, value and timestamp.
● No column limit.
● Storage is block based (default 64K).
HFile: family #1
Row key Column Value TS
... ... ... ...
... ... ... ...
HFile: family #2
Row key Column Value TS
... ... ... ...
... ... ... ...
● Delete is just another
marker record.
● Periodic compaction is
required.
14. 14www.vitech.com.ua
Architecture
● Zookeeper coordinates distributed elements and
is primary contact point for client.
● Master server keeps metadata and manages
data distribution over Region servers.
● Region servers manage data table regions but
actual data storage service including replication
is on HDFS data nodes. Clients directly
communicate with region server for data.
DATA
META
Rack
DN DN
RS RS
Rack
DN DN
RS RS
Rack
DN DN
RS RS
NameNode
Client
HMaster
Zookeeper
15. 15www.vitech.com.ua
CRUD: Put and Delete
● Writes are logged and cached in memory.
● Main thing to remember: lower layer is
WRITE ONLY filesystem (HDFS). So both
PUT and DELETE path is identical.
● Both PUT and DELETE requests are per
row key. No row key range for DELETE.
● DELETE is just another marker added.
● Actual DELETE is performed during
compactions.
● Don't forget we can have several families.
16. 16www.vitech.com.ua
CRUD: Put and Delete, write path
● Actual write is to region server. Master is not involved.
● All requests are coming to WAL (write ahead log) to
provide recovery.
● Region server keeps MemStore as temporary storage.
● Only when needed write is flushed to disk (into HFile).
17. 17www.vitech.com.ua
CRUD: Get and Scan
● Get operation is simple data request by row key.
● Scan operation is performed based on row key
range which could involve several table regions.
● Both Get and Scan can include client filters —
expressions that are processed on server side
and can seriously limit results so traffic.
● Both Scan and Get operations can be performed
on several column families.
● Get operation is implemented through Scan.
18. 18www.vitech.com.ua
DATA
META
Integration with MapReduce
● HBase provides number of classes for native
MapReduce integration. Main point is data locality.
● TableInputFormat allows massive MapReduce table
processing (maps table with one region per mapper).
● HBase classes like Result (Get / Scan result) or Put (Put
request) can be passed between MapReduce job stages.
● We have moderate experience of making things here
even better.
DataNode
NameNodeJobTracker TaskTracker
RegionServerHMaster Ofen single node
so data is local
19. 19www.vitech.com.ua
Coprocessors: Key points
● Coprocessors is feature that allows to extend
HBase without product code modification.
● RegionObserver can attach code to operations
on region level.
● Similar functionality exists for HMaster.
● Endpoints is the way to provide functionality
equal to stored procedure.
● Together coprocessor infrastructure can bring
realtime distributed processing framework
(lightweight MapReduce).
22. 22www.vitech.com.ua
Secondary indexes
● HBase has no support for secondary
indexes out-of-the-box.
● Coprocessor (RegionObserver) is used to
track Put and Delete operations and
update index table.
● Scan operations with index column filter
are intercepted and processed based on
index table content.
Table
Client
Index
table
Region
observerPut / Delete Index update
Scan with filter
Region
Index search
23. 23www.vitech.com.ua
Bulk load
● There is ability to load data in table MUCH FASTER.
● HFile is generated with required data.
● It is preferable to generate one HFile per table
region. MapReduce can be used.
● Prepared HFile is merged with table storage on
maximum speed.
Data
importers
HFile generator
HFile generator
HFile generator
Table region
Table region
Table region
Mappers Reducers
HFile
HFile
HFile
24. 24www.vitech.com.ua
HDFS
Replication and search integration
WAL, Regions
Data update
Client
User just puts (or
deletes) data.
Search responses
Lily HBase
NRT indexer
Replication can be
set up to column
family level.
REPLICATION
HBase
cluster
Translates data
changes into SOLR
index updates.
SOLR cloud
Search requests (HTTP)
Apache
Zookeeper does
all coordination
Finally provides
search
Serves low level
file system.
25. 25www.vitech.com.ua
HUG benefits for members
USER GROUP MEMBERSHIP
Just enter ‘ug367’ in
the Promotional Code
box when you check
out at manning.com.
To get this discount, please
shop on www.oreilly.com
and quote reference DSUG.