HBase is a distributed, non-relational database modeled after Google's BigTable. It provides a key-value data store that sits atop HDFS. This tutorial introduces HBase and the rhbase R package for interacting with HBase. It discusses installing HBase and rhbase, loading sample vehicle sensor data into HBase, and designing tables around query patterns to optimize retrieval of related data in one call.
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
The document is an introduction to big data and Hadoop that discusses:
1) What big data is and common use cases across different industries.
2) The characteristics of big data according to IBM.
3) An overview of the Hadoop ecosystem including HDFS, MapReduce, YARN and other related frameworks.
4) How Hadoop allows for distributed processing of large datasets across clusters of machines more efficiently than traditional systems.
HBaseCon 2012 | Storing and Manipulating Graphs in HBaseCloudera, Inc.
Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.
HBase is a scalable NoSQL database modeled after Google's Bigtable. It is built on top of HDFS for storage, and uses Zookeeper for distributed coordination and failover. Data in HBase is stored in tables and sorted by row key, with columns grouped into families and cells containing values and timestamps. HBase tables are split into regions for scalability and fault tolerance, with a master server coordinating region locations across multiple region servers.
Speaker: Jesse Anderson (Cloudera)
As optional pre-conference prep for attendees who are new to HBase, this talk will offer a brief Cliff's Notes-level talk covering architecture, API, and schema design. The architecture section will cover the daemons and their functions, the API section will cover HBase's GET, PUT, and SCAN classes; and the schema design section will cover how HBase differs from an RDBMS and the amount of effort to place on schema and row-key design.
This presentation about HBase will help you understand what is HBase, what are the applications of HBase, how is HBase is different from RDBMS, what is HBase Storage, what are the architectural components of HBase and at the end, we will also look at some of the HBase commands using a demo. HBase is an essential part of the Hadoop ecosystem. It is a column-oriented database management system derived from Google’s NoSQL database Bigtable that runs on top of HDFS. After watching this video, you will know how to store and process large datasets using HBase. Now, let us get started and understand HBase and what it is used for.
Below topics are explained in this HBase presentation:
1. What is HBase?
2. HBase Use Case
3. Applications of HBase
4. HBase vs RDBMS
5. HBase Storage
6. HBase Architectural Components
What is this Big Data Hadoop training course about?
Simplilearn’s Big Data Hadoop training course lets you master the concepts of the Hadoop framework and prepares you for Cloudera’s CCA175 Big data certification. The Big Data Hadoop and Spark developer course have been designed to impart in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
This course will enable you to:
1. Understand the different components of the Hadoop ecosystem such as Hadoop 2.7, Yarn, MapReduce, Pig, Hive, Impala, HBase, Sqoop, Flume, and Apache Spark
2. Understand Hadoop Distributed File System (HDFS) and YARN as well as their architecture, and learn how to work with them for storage and resource management
3. Understand MapReduce and its characteristics, and assimilate some advanced MapReduce concepts
4. Get an overview of Sqoop and Flume and describe how to ingest data using them
5. Create database and tables in Hive and Impala, understand HBase, and use Hive and Impala for partitioning
6. Understand different types of file formats, Avro Schema, using Arvo with Hive, and Sqoop and Schema evolution
7. Understand Flume, Flume architecture, sources, flume sinks, channels, and flume configurations
8. Understand HBase, its architecture, data storage, and working with HBase. You will also understand the difference between HBase and RDBMS
9. Gain a working knowledge of Pig and its components
10. Do functional programming in Spark
11. Understand resilient distribution datasets (RDD) in detail
12. Implement and build Spark applications
13. Gain an in-depth understanding of parallel processing in Spark and Spark RDD optimization techniques
14. Understand the common use-cases of Spark and the various interactive algorithms
15. Learn Spark SQL, creating, transforming, and querying Data frames
Learn more at https://www.simplilearn.com/big-data-and-analytics/big-data-and-hadoop-training
Jesse Anderson (Smoking Hand)
This early-morning session offers an overview of what HBase is, how it works, its API, and considerations for using HBase as part of a Big Data solution. It will be helpful for people who are new to HBase, and also serve as a refresher for those who may need one.
Hw09 Practical HBase Getting The Most From Your H Base InstallCloudera, Inc.
The document summarizes two presentations about using HBase as a database. It discusses the speakers' experiences using HBase at Stumbleupon and Streamy to replace MySQL and other relational databases. Some key points covered include how HBase provides scalability, flexibility, and cost benefits over SQL databases for large datasets.
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
The document is an introduction to big data and Hadoop that discusses:
1) What big data is and common use cases across different industries.
2) The characteristics of big data according to IBM.
3) An overview of the Hadoop ecosystem including HDFS, MapReduce, YARN and other related frameworks.
4) How Hadoop allows for distributed processing of large datasets across clusters of machines more efficiently than traditional systems.
HBaseCon 2012 | Storing and Manipulating Graphs in HBaseCloudera, Inc.
Google’s original use case for BigTable was the storage and processing of web graph information, represented as sparse matrices. However, many organizations tend to treat HBase as merely a “web scale” RDBMS. This session will cover several use cases for storing graph data in HBase, including social networks and web link graphs, MapReduce processes like cached traversal, PageRank, and clustering and lastly will look at some lower-level modeling details like row key and column qualifier design, using FullContact’s graph processing systems as a real-world use case.
HBase is a scalable NoSQL database modeled after Google's Bigtable. It is built on top of HDFS for storage, and uses Zookeeper for distributed coordination and failover. Data in HBase is stored in tables and sorted by row key, with columns grouped into families and cells containing values and timestamps. HBase tables are split into regions for scalability and fault tolerance, with a master server coordinating region locations across multiple region servers.
Apache Spark on Apache HBase: Current and Future HBaseCon
- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
This document summarizes Berk D. Demir's design for a content addressable storage system to store and serve large amounts of static assets with low latency, high availability, and without data duplication. The key aspects of the design are:
1) Using HBase as the underlying distributed database to store immutable rows of metadata and blob content in a single table with different column families based on access patterns.
2) Addressing content via a cryptographic hash of the content rather than a database key to allow immutable and deduplicated storage.
3) Serving the stored content via HTTP using common verbs and headers to provide a simple interface for clients.
HBase is a column-oriented NoSQL database built on HDFS that is designed for high throughput and large datasets. It is based on BigTable and uses HDFS for storage. Data can be accessed quickly via random reads/writes or processed in batches with MapReduce. HBase has a flexible data model with dynamic columns and is well-suited for applications needing big data storage and access.
The document discusses different archetypes or patterns for Apache HBase applications. It defines entity-centric and event-centric data, and how HBase is well-suited for use cases involving real-time random reads and writes of entity data like user profiles, messages, and graphs. Examples of good fits include using HBase as a simple key-value store for entities, as a messaging store, and for storing graph data with nodes and edges.
HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon
This document summarizes new features for analyzing HBase data with Apache Hive, including the ability to query HBase snapshots, generate HFiles for bulk uploads to HBase, support for composite and timestamp keys, and additional improvements and future work. It provides an overview of Hive and its integration with HBase, describes the new features in detail, and indicates which releases the features will be included in.
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
1. Hadoop has a master-slave topology with one master node that assigns tasks to multiple slave nodes, which do the actual computing. The slave nodes store data while the master node stores metadata.
2. MapReduce is the processing layer that breaks jobs into independent tasks that can run in parallel on slave nodes. Map performs sorting and filtering of data while Reduce summarizes the output of Map.
3. YARN manages resources across clusters by allocating resources for applications through a resource manager and node managers that monitor resources on machines.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
Lars George and Jon Hsieh presented archetypes for common Apache HBase application patterns. They defined archetypes as common architecture patterns extracted from multiple use cases to be repeatable. The presentation covered "good" archetypes that are well-suited to HBase's capabilities, such as storing simple entities, messaging data, and metrics. "Bad" archetypes that are not optimal fits for HBase included using it as a large blob store, naively porting a relational database schema, and as an analytic archive requiring frequent full scans. A discussion of access patterns and tradeoffs concluded the overview of HBase application archetypes.
The document provides an overview of big data, analytics, Hadoop, and related concepts. It discusses what big data is and the challenges it poses. It then describes Hadoop as an open-source platform for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop introduced include HDFS for storage, MapReduce for parallel processing, and various other tools. A word count example demonstrates how MapReduce works. Common use cases and companies using Hadoop are also listed.
Batch processes are critical for agile application development but existing batch scheduling tools are not keeping up with modern approaches like DevOps, containers, cloud, and big data. BMC Control-M is a batch scheduling solution that addresses these issues by providing enterprise-scale workflow scheduling that integrates with the entire technology ecosystem, supports DevOps methodologies, and offers self-service, monitoring, and automation capabilities.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
This Edureka Hadoop Ecosystem Tutorial (Hadoop Ecosystem blog: https://goo.gl/EbuBGM) will help you understand about a set of tools and services which together form a Hadoop Ecosystem. Below are the topics covered in this Hadoop Ecosystem Tutorial:
Hadoop Ecosystem:
1. HDFS - Hadoop Distributed File System
2. YARN - Yet Another Resource Negotiator
3. MapReduce - Data processing using programming
4. Spark - In-memory Data Processing
5. Pig, Hive - Data Processing Services using Query
6. HBase - NoSQL Database
7. Mahout, Spark MLlib - Machine Learning
8. Apache Drill - SQL on Hadoop
9. Zookeeper - Managing Cluster
10. Oozie - Job Scheduling
11. Flume, Sqoop - Data Ingesting Services
12. Solr & Lucene - Searching & Indexing
13. Ambari - Provision, Monitor and Maintain Cluster
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
Jose Antonio Muila Casimiro is an Angolan national born in Cabinda in 1987. He received his degree in Mathematics Education from Cabinda Public University in 2014. He has over 10 years of experience in oil and gas industry holding positions such as Drilling Tool Supervisor and Mechanical Maintenance Technician. He is fluent in Portuguese, English and Ibinda and has technical knowledge in hardware, software, and computer repair. In his free time, he enjoys reading, spending time with family, listening to music, and relaxing at the seaside.
Apache Spark on Apache HBase: Current and Future HBaseCon
- The document discusses Spark HBase Connector which combines Spark and HBase for fast access to key-value data. It allows running Spark and SQL queries directly on top of HBase tables.
- It provides high performance through data locality, partition pruning, and column pruning to reduce network overhead. Operations include bulk load, bulk put, bulk delete, and language integrated queries.
- The connector achieves improvements through a Spark Catalyst engine for query planning and optimization, and implementing HBase as an external data source with built-in filtering capabilities.
This document introduces HBase, an open-source, non-relational, distributed database modeled after Google's BigTable. It describes what HBase is, how it can be used, and when it is applicable. Key points include that HBase stores data in columns and rows accessed by row keys, integrates with Hadoop for MapReduce jobs, and is well-suited for large datasets, fast random access, and write-heavy applications. Common use cases involve log analytics, real-time analytics, and messages-centered systems.
HBase In Action - Chapter 04: HBase table designphanleson
HBase In Action - Chapter 04: HBase table design
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...Cloudera, Inc.
This document summarizes Berk D. Demir's design for a content addressable storage system to store and serve large amounts of static assets with low latency, high availability, and without data duplication. The key aspects of the design are:
1) Using HBase as the underlying distributed database to store immutable rows of metadata and blob content in a single table with different column families based on access patterns.
2) Addressing content via a cryptographic hash of the content rather than a database key to allow immutable and deduplicated storage.
3) Serving the stored content via HTTP using common verbs and headers to provide a simple interface for clients.
HBase is a column-oriented NoSQL database built on HDFS that is designed for high throughput and large datasets. It is based on BigTable and uses HDFS for storage. Data can be accessed quickly via random reads/writes or processed in batches with MapReduce. HBase has a flexible data model with dynamic columns and is well-suited for applications needing big data storage and access.
The document discusses different archetypes or patterns for Apache HBase applications. It defines entity-centric and event-centric data, and how HBase is well-suited for use cases involving real-time random reads and writes of entity data like user profiles, messages, and graphs. Examples of good fits include using HBase as a simple key-value store for entities, as a messaging store, and for storing graph data with nodes and edges.
HBaseCon 2015: Analyzing HBase Data with Apache HiveHBaseCon
This document summarizes new features for analyzing HBase data with Apache Hive, including the ability to query HBase snapshots, generate HFiles for bulk uploads to HBase, support for composite and timestamp keys, and additional improvements and future work. It provides an overview of Hive and its integration with HBase, describes the new features in detail, and indicates which releases the features will be included in.
Big Data Fundamentals in the Emerging New Data WorldJongwook Woo
I talk about the fundamental of Big Data, which includes Hadoop, Intensive Data Computing, and NoSQL database that have received highlight to compute and store BigData that is usually greater then Peta-byte data. Besides, I
introduce the case studies that use Hadoop and NSQL DB.
1. Hadoop has a master-slave topology with one master node that assigns tasks to multiple slave nodes, which do the actual computing. The slave nodes store data while the master node stores metadata.
2. MapReduce is the processing layer that breaks jobs into independent tasks that can run in parallel on slave nodes. Map performs sorting and filtering of data while Reduce summarizes the output of Map.
3. YARN manages resources across clusters by allocating resources for applications through a resource manager and node managers that monitor resources on machines.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
The tech talk was gieven by Ranjeeth Kathiresan, Salesforce Senior Software Engineer & Gurpreet Multani, Salesforce Principal Software Engineer in June 2017.
The document contains 31 questions and answers related to Hadoop concepts. It covers topics like common input formats in Hadoop, differences between TextInputFormat and KeyValueInputFormat, what are InputSplits and how they are created, how partitioning, shuffling and sorting occurs after the map phase, what is a combiner, functions of JobTracker and TaskTracker, how speculative execution works, using distributed cache and counters, setting number of mappers/reducers, writing custom partitioners, debugging Hadoop jobs, and failure handling processes for production Hadoop jobs.
Lars George and Jon Hsieh presented archetypes for common Apache HBase application patterns. They defined archetypes as common architecture patterns extracted from multiple use cases to be repeatable. The presentation covered "good" archetypes that are well-suited to HBase's capabilities, such as storing simple entities, messaging data, and metrics. "Bad" archetypes that are not optimal fits for HBase included using it as a large blob store, naively porting a relational database schema, and as an analytic archive requiring frequent full scans. A discussion of access patterns and tradeoffs concluded the overview of HBase application archetypes.
The document provides an overview of big data, analytics, Hadoop, and related concepts. It discusses what big data is and the challenges it poses. It then describes Hadoop as an open-source platform for distributed storage and processing of large datasets across clusters of commodity hardware. Key components of Hadoop introduced include HDFS for storage, MapReduce for parallel processing, and various other tools. A word count example demonstrates how MapReduce works. Common use cases and companies using Hadoop are also listed.
Batch processes are critical for agile application development but existing batch scheduling tools are not keeping up with modern approaches like DevOps, containers, cloud, and big data. BMC Control-M is a batch scheduling solution that addresses these issues by providing enterprise-scale workflow scheduling that integrates with the entire technology ecosystem, supports DevOps methodologies, and offers self-service, monitoring, and automation capabilities.
Hadoop World 2011: Advanced HBase Schema DesignCloudera, Inc.
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second.
This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they can be implemented on top of HBase, using schemas that optimize for the given access patterns.
Hadoop Ecosystem | Big Data Analytics Tools | Hadoop Tutorial | Edureka Edureka!
This Edureka Hadoop Ecosystem Tutorial (Hadoop Ecosystem blog: https://goo.gl/EbuBGM) will help you understand about a set of tools and services which together form a Hadoop Ecosystem. Below are the topics covered in this Hadoop Ecosystem Tutorial:
Hadoop Ecosystem:
1. HDFS - Hadoop Distributed File System
2. YARN - Yet Another Resource Negotiator
3. MapReduce - Data processing using programming
4. Spark - In-memory Data Processing
5. Pig, Hive - Data Processing Services using Query
6. HBase - NoSQL Database
7. Mahout, Spark MLlib - Machine Learning
8. Apache Drill - SQL on Hadoop
9. Zookeeper - Managing Cluster
10. Oozie - Job Scheduling
11. Flume, Sqoop - Data Ingesting Services
12. Solr & Lucene - Searching & Indexing
13. Ambari - Provision, Monitor and Maintain Cluster
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
When two of the most powerful innovations in modern analytics come together, the result is revolutionary.
This presentation covers:
- An overview of R, the Open Source programming language used by more than 2 million users that was specifically developed for statistical analysis and data visualization.
- The ways that R and Hadoop have been integrated.
- A use case that provides real-world experience.
- A look at how enterprises can take advantage of both of these industry-leading technologies.
Presented at Hadoop World 2011 by:
David Champagne
CTO, Revolution Analytics
David Champagne is a top software architect, programmer and product manager with over 20 years experience in enterprise and web application development for business customers across a wide range of industries. As Principal Architect/Engineer for SPSS, Champagne led the development teams and created and led the text mining team.
Jose Antonio Muila Casimiro is an Angolan national born in Cabinda in 1987. He received his degree in Mathematics Education from Cabinda Public University in 2014. He has over 10 years of experience in oil and gas industry holding positions such as Drilling Tool Supervisor and Mechanical Maintenance Technician. He is fluent in Portuguese, English and Ibinda and has technical knowledge in hardware, software, and computer repair. In his free time, he enjoys reading, spending time with family, listening to music, and relaxing at the seaside.
20,000 Hours in the Cloud - Top 5 Cloud Lessons Learned By Tom Lounibos, CEO ...SOASTA
The CEO of SOASTA and cloud veteran, Tom Lounibos' presented at Cloud Connect Shanghai, China on September 16, 2013. View this presentation to learn about his perspective on the transformative powers of the cloud in the web and mobile era.
RHive allows users to leverage R's syntax and functions to process large datasets using Hive's mapreduce capabilities. It includes functions like rhive.napply and rhive.sapply that apply R functions to Hive tables in a mapreduce manner, as well as rhive.mrapply, rhive.mapapply, and rhive.reduceapply which allow writing mapreduce jobs similarly to Hadoop streaming. The document provides examples of using these functions to perform operations like word counting on a text dataset.
R hive tutorial supplement 3 - Rstudio-server setup for rhiveAiden Seonghak Hong
This document provides instructions for setting up RStudio Server to use the RHive package for analyzing large datasets with R. It describes downloading and installing RStudio Server, creating user accounts, starting the RStudio Server daemon, and connecting to it via a web browser. It also covers potential issues with RHive environment variables and how to resolve them by setting the variables before or after loading the RHive library. The overall goal is to enable convenient use of RHive through RStudio Server's remote desktop interface and ability to keep R sessions running on the server.
The RHive functions rhive.hdfs.connect, rhive.hdfs.ls, rhive.hdfs.get, rhive.hdfs.put, rhive.hdfs.rm, and rhive.hdfs.rename allow users to interact with HDFS from within R without using the Hadoop command line interface or libraries. These functions provide the same functionality as common Hadoop fs commands like hadoop fs -ls, -get, -put, -rm, and -mv to list, get, put, remove, and rename files in HDFS. The document provides examples of how to use each RHive HDFS function to perform operations like listing files, uploading a local file to HDFS, and deleting files
The document provides information on advanced functions in RHive including UDF, UDAF, and UDTF. It explains:
1) RHive allows writing UDF and UDAF functions in R and deploying them for use in Hive, allowing R functions to be used at lower complexity levels in Hive's MapReduce programming.
2) The rhive.assign, rhive.export, and rhive.exportAll functions are used to deploy R functions and objects to the distributed Hive environment for processing large datasets.
3) An example demonstrates creating a sum function in R, assigning it using rhive.assign, and executing it on the USArrests Hive table using rhive
OLAP Battle - SolrCloud vs. HBase: Presented by Dragan Milosevic, Zanox AGLucidworks
SolrCloud with document values or HBase with endpoint coprocessors can be used for online analytical processing (OLAP) to efficiently perform aggregations over large amounts of data. SolrCloud provides document values, which allow aggregating field values without retrieving individual documents, while HBase uses endpoint coprocessors to run custom aggregation code directly on region servers. For single field aggregations either can be used, but HBase performs better for aggregations over multiple fields if the row keys and column families are optimized for the expected queries.
This document discusses using multi-table hashing to efficiently join large datasets in SAS. It describes a problem where joining six tables with 25 million rows using PROC SQL took over 13 hours before failing. As an alternative, the document shows how to:
1) Initialize hash variables for the six tables
2) Declare hash tables specifying the dataset and size
3) Define the keys and data for each hash
4) Loop through the large base table, performing lookups in each hash to append joined data
5) Output the results, taking 29 minutes to complete versus over 13 hours for PROC SQL
Discussion on Distributed caching, caching mechanism, and how consistent hashing should be implemented and how it should be used across the system. It demonstrates how the system should be designed and how to implement caching in individual layers of the system
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2slpJqY
This CloudxLab Introduction to HBase tutorial helps you to understand HBase in detail. Below are the topics covered in this tutorial:
1) HBase - Data Models Examples
2) Bloom Filter
3) HBase - REST APIs
4) HBase - Hands-on Demos on CloudxLab
Optimization on Key-value Stores in Cloud EnvironmentFei Dong
This document discusses optimizing key-value stores like HBase in cloud environments. It introduces HBase, a distributed, column-oriented database built on HDFS that provides scalable storage and retrieval of large datasets. The document compares rule-based and cost-based optimization strategies, and explores using rule-based optimization to analyze HBase's performance when deployed on Amazon EC2 instances. It describes developing an HBase profiler to measure the costs of using HBase for storage.
This document provides an overview of NoSQL databases and the HBase framework. It discusses key aspects of NoSQL including advantages like high scalability and schema flexibility. It then describes the different categories of NoSQL databases including key-value, column-oriented, graph and document oriented. The document proceeds to explain aggregate data models and how key-value and document databases are aggregate-oriented. It provides details on HBase, describing it as a column-oriented database, and its architecture, data model involving tables, rows, column families and cells.
Big Data Frameworks: Introduction to NoSQL – Aggregate Data Models – Hbase: Data Model and Implementations – Hbase Clients – Examples – .Cassandra: Data Model – Examples – Cassandra Clients – Hadoop Integration. Pig – Grunt – Pig Data Model – Pig Latin – developing and testing Pig Latin scripts. Hive – Data Types and File Formats – HiveQL Data Definition – HiveQL Data Manipulation – HiveQL Queries
The document provides information on various components of the Hadoop ecosystem including Pig, Zookeeper, HBase, Spark, and Hive. It discusses how HBase offers random access to data stored in HDFS, allowing for faster lookups than HDFS alone. It describes the architecture of HBase including its use of Zookeeper, storage of data in regions on region servers, and secondary indexing capabilities. Finally, it summarizes Hive and how it allows SQL-like queries on large datasets stored in HDFS or other distributed storage systems using MapReduce or Spark jobs.
HBase is a column-oriented NoSQL database that provides random real-time read/write access to big data stored in Hadoop's HDFS. It is modeled after Google's Bigtable and sits on top of HDFS to allow fast access to large datasets. HBase architecture includes HMaster, HRegionServers, ZooKeeper, and HDFS. HMaster manages metadata and load balancing while HRegionServers serve read/write requests directly from clients. ZooKeeper coordinates the cluster and HDFS provides storage. Data is stored in tables divided into regions hosted by HRegionServers.
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
This document provides an overview of HBase, including:
- HBase is a distributed, scalable, big data store modeled after Google's BigTable. It provides a fault-tolerant way to store large amounts of sparse data.
- HBase is used by large companies to handle scaling and sparse data better than relational databases. It features automatic partitioning, linear scalability, commodity hardware, and fault tolerance.
- The document discusses HBase operations, schema design best practices, hardware recommendations, alerting, backups and more. It provides guidance on designing keys, column families and cluster configuration to optimize performance for read and write workloads.
HBase is a distributed column-oriented database built on top of HDFS that provides random real-time read/write access to large amounts of structured data stored in HDFS. It uses a column-oriented data model where data is stored in columns that are grouped together into column families and tables are divided into regions distributed across region servers. HBase is part of the Hadoop ecosystem and provides an interface for applications to perform read and write operations on data stored in HDFS.
HBase is a distributed column-oriented database built on top of Hadoop that provides quick random access to large amounts of structured data. It leverages the fault tolerance of HDFS and allows for real-time read/write access to data stored in HDFS. HBase sits above HDFS and provides APIs for reading and writing data randomly. It is a scalable, schema-less database modeled after Google's Bigtable.
Data Storage and Management project ReportTushar Dalvi
This paper aims at evaluating the performance of random reads and random writes the information of HBase and Cassandra and compare the results that we got through various ubuntu operation
The document discusses NoSQL databases and big data frameworks. It defines NoSQL databases as next generation databases that are non-relational, distributed, open-source and horizontally scalable. It describes four main categories of NoSQL databases - document databases, key-value stores, column-oriented databases and graph databases. It also discusses properties of NoSQL databases and provides examples of popular NoSQL databases. The document then discusses big data frameworks like Hadoop and its ecosystem including HDFS, MapReduce, YARN and Hadoop Common. It provides details on how these components work together to process large datasets in a distributed manner.
This document discusses how to implement operations like selection, joining, grouping, and sorting in Cassandra without SQL. It explains that Cassandra uses a nested data model to efficiently store and retrieve related data. Operations like selection can be performed by creating additional column families that index data by fields like birthdate and allow fast retrieval of records by those fields. Joining can be implemented by nesting related entity data within the same column family. Grouping and sorting are also achieved through additional indexing column families. While this requires duplicating data for different queries, it takes advantage of Cassandra's strengths in scalable updates.
This document provides an overview of HBase, including its architecture and how it compares to relational databases and HDFS. Some key points:
- HBase is a non-relational, distributed, column-oriented database that runs on top of Hadoop. It uses a master-slave architecture with an HMaster and multiple HRegionServers.
- Unlike relational databases, HBase is schema-less, column-oriented, and designed for denormalized data in wide, sparsely populated tables.
- Compared to HDFS, HBase provides low-latency random reads/writes instead of batch processing. Data is accessed via APIs instead of MapReduce.
- HBase uses LSM
The document provides an introduction to NoSQL and HBase. It discusses what NoSQL is, the different types of NoSQL databases, and compares NoSQL to SQL databases. It then focuses on HBase, describing its architecture and components like HMaster, regionservers, Zookeeper. It explains how HBase stores and retrieves data, the write process involving memstores and compaction. It also covers HBase shell commands for creating, inserting, querying and deleting data.
The project is focussed on Comparison Between HBASE and CASSANDRA using YCSB. It is a data storage and management project performed at National College Of Ireland
1. Intro to HBase via R
Aaron Benz
04/28/2015
Part I: Intro to HBase
Welcome to a brief introduction to HBase by way of R. This tutorial aims to explain how you can use R
through the rhbase package. It will use my custom addition to rhbase which is geared toward using tidyr
principles and data.tables/data.frames. Other differences include:
1. Standardized row-key serializer (raw data type)
2. sz and usz functions only apply to the actual data instead of to row-key
3. hb.put - wrapper around hb.insert for easier inputting
4. hb.pull - wrapper around hb.scan for easier retrieval
The tutorial has three parts:
1. Installing HBase, Thrift, and rhbase, with a brief intro to HBase
2. Inserting data into HBase, and basic design/modeling
3. Retrieving data from HBase, doing calculations, and inserting calculations
What is HBase?
By no means will I attempt to explain all of HBase, but here is a brief attempt to summarize:
Wikipedia definition: HBase is an open source, non-relational, distributed database modeled after Google’s
BigTable and written in Java. It is developed as part of Apache Software Foundation’s Apache Hadoop
project and runs on top of HDFS (Hadoop Distributed File system), providing BigTable-like capabilities for
Hadoop.
From the Apache HBase website: “Use Apache HBase when you need random, realtime read/write
access to your Big Data. This project’s goal is the hosting of very large tables – billions of rows X millions
of columns – atop clusters of commodity hardware. Apache HBase is an open-source, distributed,
versioned, non-relational database modeled after Google’s BigTable. . . Just as BigTable leverages the
distributed data storage provided by the Google File System, Apache HBase provides BigTable-like capabilities
on top of Hadoop and HDFS.”
Clearly with those definitions you should know what HBase is now, right? If you’re like me, a little more
explanation is helpful. To start wrapping your head around NoSQL and HBase, here is a 5-step progression
that I have found helpful:
1. A Key-Value Store
• At its core, that’s all HBase really is. It’s a map, a dictionary (python), a hash (ruby), etc. . .
2. Distributed
• HBase sits atop a Hadoop cluster, meaning from HBase’s perspective that the data in HBase is
replicated out to, by default, 3 nodes in a Hadoop Cluster.
• Based on the assumption that servers go down and bad things happen
• A full implementation relies on HDFS, Zookeeper, and your HBase region master(s)
1
2. 3. Consistent
• HBase is an immediately consistent database. That means that HBase guarantees your data to be
exactly the same on all of the nodes that it is replicated to.
• See CAP Theorem for more information on Consistent, Available, and Partition tolerant databases
(Cassandra is a similar NoSQL Columnar Database emphasizing Availability over Consistency)
4. Sorted
• The Keys in this key-value store are stored alphabetically
• Designed for faster reads than writes
• Allows for fast “scanning” through keys
5. Multi-Dimensional Key-Value Store
• It is a key-value store, but perhaps a better way to think about HBase’s structure is a key-column-
value store, or a Map of Maps
• This is what is referred to as “columnar” or “wide rows”
• For any given key, HBase allows you to store any amount of information
• Schema-less - your key-column-value combination can be defined at any time and does not naturally
conform to any schema
HBase Frame of Mind
• First there is a HBase table, which is exactly what you would think it is: a table
• Within a table are column families, which are a subgroup of your table. Best practice is to limit the
number and size of these. So, if you are new to HBase, just specify one column family, which is all that
is necessary in many cases.
• All data is then accessed via a rowkey, which is essentially your indexing mechanism (row-key or range
of row-keys points directly to the correct data). This is also your key in “key-column-value” as depicted
earlier
• Within a given row, there can be potentially millions of columns. This is the concept of wide rows.
Although many types of data lend themselves to columnar storage, time-series data in particular is a
good use case:
– Timestamps stored as column names
– Variable value correlating to a particular timestamp in a cell
– Variable name in the row key
– This concept is often hard to grasp the first time, so I have provided some visuals to help explain
it. Many people click on this concept when they realize that values are/can be stored as columns.
• Schemaless. You do not need to add columns in advance; you can do it on the fly. However, you
should keep record or develop a scheme of how you are storing data as the actual retrieval will be very
difficult if you have no idea what is in there.
• Data modeling: Based off query patterns and stored directly. Cross-table joins are a BAD thing
(Spark can help with this, but that does not mean you should design your data model to do any kind of
joins). Essentially you are sacrificing complex querying for huge speed gains.
Hopefully that helped, and if not, there is plenty of information out there about HBase and what it does.
Here are a few links:
• Apache HBase
• Wikipedia
• Hortonworks
2
3. Installing HBase and rhbase
In order to use this stuff, you have to install HBase, Thrift, and the rhbase package. The basic instructions
are here, but if you are trying to get up and running as soon as possible, here are a few helpful hints(change
version numbers as needed):
1. Install Thrift following this guide
2. Update PKG_CONFIG_PATH: export PKG_CONFIG_PATH=$PKG_CONFIG_PATH:/usr/local/lib/pkgconfig/
3. Verify pkg-config path is correct: pkg-config –cflags Thrift , returns: -I/usr/local/include/Thrift
4. Copy Thrift library sudo cp /usr/local/lib/libThrift-0.8.0.so /usr/lib/
• Under this implementation I would advise using Thrift 0.8.0 as the latest version might
include some bugs as it relates to this build
5. Install HBase following Apache’s quick start guide
6. Start up HBase and Thrift
[hbase-root]/bin/start-hbase.sh
[hbase-root]/bin/hbase Thrift start
7. Now download and install the package with devtools (or get a tarball copy here)
install.packages("devtools")
devtools::install_github("aaronbenz/rhbase")
Test it out
Provided HBase, Thrift, and rhbase are now installed correctly and are running, the code below should run
successfully.
library(rhbase)
library(data.table)
hb.init()
## <pointer: 0x21efd20>
## attr(,"class")
## [1] "hb.client.connection"
Understanding Our Sample Data
The data that is supplied in this tutorial is simulated time-series data for airport support vehicles (such
as baggage trucks) from various airports over a small period of time. The data is stored hierarchically as:
Airport_Day_VehicleID_Variable. You can retrieve a list of all of the data simply by loading it from the
rhbase package:
data(baggage_trucks)
str(baggage_trucks[1:3])
3
4. ## List of 3
## $ JFK_20140306_1CKPH7747ZK277944_gear :Classes 'data.table' and 'data.frame': 126 obs. of 2 varia
## ..$ date_time: num [1:126] 1.39e+09 1.39e+09 1.39e+09 1.39e+09 1.39e+09 ...
## ..$ gear : num [1:126] 1 2 3 4 3 2 1 2 3 4 ...
## ..- attr(*, ".internal.selfref")=<externalptr>
## $ JFK_20140306_1CKPH7747ZK277944_rpm :Classes 'data.table' and 'data.frame': 13770 obs. of 2 var
## ..$ date_time: num [1:13770] 1.39e+09 1.39e+09 1.39e+09 1.39e+09 1.39e+09 ...
## ..$ rpm : num [1:13770] 1012 1229 1460 1758 1706 ...
## ..- attr(*, ".internal.selfref")=<externalptr>
## $ JFK_20140306_1CKPH7747ZK277944_speed:Classes 'data.table' and 'data.frame': 13793 obs. of 2 var
## ..$ date_time: num [1:13793] 1.39e+09 1.39e+09 1.39e+09 1.39e+09 1.39e+09 ...
## ..$ speed : num [1:13793] 16.9 17.3 17.6 17.7 18.1 ...
## ..- attr(*, ".internal.selfref")=<externalptr>
The three variables are gear, speed (mph), and rpm (revolutions per minute of engine). With these variables,
the goal is to calculate the fuel rate of a vehicle. For more information about the data use ?baggage_trucks.
Note: Credit to Spencer Herath for creating the sample data set of imaginary trucks whizzing around an
invisible airport.
Part II: Getting Data Into HBase with R
HBase Table Design
It’s important that the HBase table is designed to fit your query pattern(s) exactly. A NoSQL Columnar
frame of mind is always Design Your Tables For Your Query Pattern. Unlike a relational store, each
table that is built is traditionally designed for one single type of query pattern (document stores like Solr or
Elastic Search can offer a backwards indexing solution. Overall, this can make a data modeling experience
“simpler” in concept). A good way to think about NoSQL is that in one fetch, you want to retrieve all
of the data neccessary to make a business decision. To recap slightly, this frame of mind implies that:
• Avoid joins completely if possible
• Do as few lookups as possible
• Do all data transformations on the incoming stream versus the outgoing
Remember, the goal is to get the data out and delivered as soon as possible, so take special precaution to
ensure that the data and the query patterns are designed for the end format, not the raw format.
Currently the data in its raw format is organized by variable, by vehicle, and by date. Because this is
archived data, additionally compression to a binary blob(byte array) is achievable. If properly done, it will
drastically reduce the size of the data set and immensely increase the speed at which it is retrieved (because
the dataset sits in just one blob as opposed to many cells). However, the approach should be sensitive to
any memory limitations because the size and number of blobs have to be manageable. That is, data needs
to be retrieved in manageable partitions that contain all of the necessary variables to perform a fuel usage
calculation. With those concepts in mind, the query pattern should be able to take on specific airports and
date ranges, along with whatever variables that are desired. Thus:
• row-key = airport::day::vin
• column = variable
• value = specific data.table
4
5. Create HBase Table Now that the data model is defined, a table must be created to use it. In this case,
we create a table called Test with a column family called test:
## [1] TRUE
hostLoc = '127.0.0.1' #Give your server IP
port = 9090 #Default port for Thrift service
hb.init()
## <pointer: 0x23f6fd0>
## attr(,"class")
## [1] "hb.client.connection"
hb.list.tables()
## $Test_Text
## maxversions compression inmemory bloomfiltertype bloomfiltervecsize
## test: 3 NONE FALSE NONE 0
## bloomfilternbhashes blockcache timetolive
## test: 0 FALSE 2147483647
TABLE_NAME = "Test"
COLUMN_FAMILY = "test"
hb.new.table(TABLE_NAME, COLUMN_FAMILY)
## [1] TRUE
5
6. Input Data into HBase
Now that the HBase table is created, just put the baggage_trucks data into HBase. This can be done with
the convenient hb.put function:
require(magrittr, quietly = T)
require(tidyr, quietly = T, warn.conflicts = F)
data(baggage_trucks)
dt_names <- names(baggage_trucks) %>%
data.frame() %>%
tidyr::separate(".",c("airport", "date", "vin", "variable"))
dt_names <- dt_names %>% unite(rowkey, airport, date, vin, sep = "::")
head(dt_names)
## rowkey variable
## 1 JFK::20140306::1CKPH7747ZK277944 gear
## 2 JFK::20140306::1CKPH7747ZK277944 rpm
## 3 JFK::20140306::1CKPH7747ZK277944 speed
## 4 JFK::20140306::1CWPJ7320VE852372 gear
## 5 JFK::20140306::1CWPJ7320VE852372 rpm
## 6 JFK::20140306::1CWPJ7320VE852372 speed
How hb.put Works The design of hb.put is meant to be simple and flexible. For a given table and
column family, hb.put allows the creation of a list of “columns” and a list of “values” for each “row-key.”
This option is designed for inputting multiple columns into the same row-key (an uncompressed time-series
use case). Additionally, hb.put allows the insertion of data using a 1-1-1 ratio like in this example:
hb.put(table_name = TABLE_NAME,
column_family = COLUMN_FAMILY,
rowkey = dt_names$rowkey,
column = dt_names$variable,
value = baggage_trucks)
## [1] TRUE
And just like that, data is in HBase. Before proceeding further, it is important to understand how data was
put into HBase, as this is actually a modification from the original rhbase package. The row-keys are turned
into byte arrays using the charToRaw method, converting the character row-key into a raw binary data type.
The data.tables are turned into byte arrays using R’s native serializer. If you would like to use your own
serializer for the actual data(because it is already serialized, e.g.), input the data as raw by specifying sz =
“raw”, “character”, or custom function in hb.put, or specify it in the original hb.init function. Note: the
row-key serializer is not editable at the moment. The change to this branch was separating the serialization
method for the row-keys from the values.
Examples of Retrieving Data Now that data is inserted, here are some brief examples worth reviewing
to understand how data can be retrieved from HBase with this package.
1. Retrieving only from 03/06/2014 onward for LAX and for just the “Speed” variable:
6
7. hb.pull(TABLE_NAME,
COLUMN_FAMILY,
start = "LAX::20140306",
end = "LAXa",
columns = "speed",
batchsize = 100)
## rowkey column_family column values
## 1: LAX::20140306::1CHMP60486H283129 test speed <data.table>
## 2: LAX::20140306::1FAJL35763D392894 test speed <data.table>
## 3: LAX::20140306::1FJAL1998VS238734 test speed <data.table>
## 4: LAX::20140306::1FSMZ91563C548451 test speed <data.table>
## 5: LAX::20140307::1CHMP60486H283129 test speed <data.table>
## 6: LAX::20140307::1FAJL35763D392894 test speed <data.table>
## 7: LAX::20140307::1FJAL1998VS238734 test speed <data.table>
## 8: LAX::20140307::1FSMZ91563C548451 test speed <data.table>
2. Retrieving everything between the start of 03/07/2014 and the start of 03/08/2014:
hb.pull(TABLE_NAME,
COLUMN_FAMILY,
start = "LAX::20140307",
end = "LAX::20140308",
batchsize = 100)
## rowkey column_family column values
## 1: LAX::20140307::1CHMP60486H283129 test gear <data.table>
## 2: LAX::20140307::1CHMP60486H283129 test rpm <data.table>
## 3: LAX::20140307::1CHMP60486H283129 test speed <data.table>
## 4: LAX::20140307::1FAJL35763D392894 test gear <data.table>
## 5: LAX::20140307::1FAJL35763D392894 test rpm <data.table>
## 6: LAX::20140307::1FAJL35763D392894 test speed <data.table>
## 7: LAX::20140307::1FJAL1998VS238734 test gear <data.table>
## 8: LAX::20140307::1FJAL1998VS238734 test rpm <data.table>
## 9: LAX::20140307::1FJAL1998VS238734 test speed <data.table>
## 10: LAX::20140307::1FSMZ91563C548451 test gear <data.table>
## 11: LAX::20140307::1FSMZ91563C548451 test rpm <data.table>
## 12: LAX::20140307::1FSMZ91563C548451 test speed <data.table>
Part III: Retrieve and Store
Okay, now that data is in HBase, let’s:
1. Retrieve data with rhbase
2. Manipulate data using tidyr, data.table, and timeseriesr
3. Perform calculations with timeseriesr
4. Store results using rhbase
7
8. Retrieving Data with rhbase
From the HBase input tutorial, we stored data.tables via byte arrays in HBase (from hbase_input document).
But what about getting it out? By using hb.pull, we will be able to pull our desired HBase data.
Going back to the use case at hand, our goal is to measure the total fuel consumption of the trucks present in
the data. To do so, we need to call gear, rpm, and speed (the three variables we put in HBase) and apply a
custom fuel calculator in R. However, as mentioned previously, we need to be careful about how we bring
data in, as too much data at one time could easily lead to memory problems.
Based on how we modeled our data, we are going to retrieve ALL variables for ALL VINs by EACH
day for EACH airport. We can do this in HBase with the scan function. Recalling the description of
HBase, all of the keys or row-keys are sorted alphabetically. This means that all of the data for one airport is
stored contiguously. Additionally, because of our row-key structure, all of the data is sorted by Airport by
Date. A scan operation will allow us to get back all of the data for one airport for one day in essentially one
internal operation process(iop). That is, with one scan we can get all of the data because it’s located in one
block of rows. So how can we do this in R?
We begin with a list of our airports:
airports <- c("JFK","LAX")
Then we create a list of all the dates we want, which in this case is between the 6th and 8th of March, 2014.
start_dates <- paste0("2014030", 6:7)
end_dates <- paste0("2014030", 7:8)
Now let’s create a function that will allow us to pull back all of the variables for one VIN, one Day at a time,
to demonstrate responsible memory management.
library(data.table)
rk_generator <- function(start, end, ...){
function(x){
data.table(start = paste(x,start, ...), end = paste(x,end,...))
}
}
march <- rk_generator(start_dates, end_dates, sep = "::")
The functional march allows us to have all of the timestamps we want for each airport. For example:
march(airports[1])
## start end
## 1: JFK::20140306 JFK::20140307
## 2: JFK::20140307 JFK::20140308
This output will feed our function to call HBase for each day for each airport.
Pull Data, Merge, and Calculate
1. Bring in some data for one day
8
9. 2. Merge the data together to do a proper fuel calculation of gal_per_hr, average speed, and time in use
3. Visualize some of the results:
a_day <- march(airports[1])[1]
Okay, time to bring in some data:
library(rhbase)
library(magrittr)
hb.init()
## <pointer: 0x3987230>
## attr(,"class")
## [1] "hb.client.connection"
data <- hb.pull("Test",
"test",
start = a_day[[1]],
end = a_day[[2]],
columns = c("gear","rpm","speed"))
data[1:6]
## rowkey column_family column values
## 1: JFK::20140306::1CKPH7747ZK277944 test gear <data.table>
## 2: JFK::20140306::1CKPH7747ZK277944 test rpm <data.table>
## 3: JFK::20140306::1CKPH7747ZK277944 test speed <data.table>
## 4: JFK::20140306::1CWPJ7320VE852372 test gear <data.table>
## 5: JFK::20140306::1CWPJ7320VE852372 test rpm <data.table>
## 6: JFK::20140306::1CWPJ7320VE852372 test speed <data.table>
It’s that simple.
Manipulate Data with tidyr, data.table, and timeseriesr
Next let’s do something with this data. Our goal is to combine the gear, rpm, and speed data.tables
by VIN. To do this, we will: 1. Split the row key to make the values meaningful with tidyr 2. Combine
data.tables with VIN in mind 3. Clean up merged data
1. Split with tidyr:
data <- data %>%
tidyr::separate(col = "rowkey", into = c("airport","day","vin"))
data[1:5]
## airport day vin column_family column values
## 1: JFK 20140306 1CKPH7747ZK277944 test gear <data.table>
## 2: JFK 20140306 1CKPH7747ZK277944 test rpm <data.table>
## 3: JFK 20140306 1CKPH7747ZK277944 test speed <data.table>
## 4: JFK 20140306 1CWPJ7320VE852372 test gear <data.table>
## 5: JFK 20140306 1CWPJ7320VE852372 test rpm <data.table>
9
10. 2. Combine by VIN and column with rbindlist
#rbind by variable
setkeyv(data, c("vin", "column"))
merge_em <- function(values){
if(length(values)<=1) return(values)
out <- values[[1]]
for(i in 2:length(values)){
out <- merge(out, values[[i]], all=T, by = "date_time")
}
out
}
data2 <- data[,list("rbinded" = list(merge_em(values))),
by=c("vin","day","airport")] #data.table functionality
data2$rbinded[[1]] %>% setDT
3. Clean up our data with timeseriesr. Because our timestamps for each variable were not guaranteed
to match, we probably (and do) have NA values in each data set. This use of dtreplace will take any
NA values and replace them with the last observation.
data2$rbinded <- data2$rbinded %>%
lapply(timeseriesr::dtreplace) %>%
lapply(setDT) #fills in missing NAs
data2$rbinded[[1]]
## date_time gear rpm speed
## 1: 1394112055 1 1012.492 16.937217
## 2: 1394112055 1 1228.810 17.264578
## 3: 1394112055 1 1459.631 17.585472
## 4: 1394112056 1 1757.546 17.743490
## 5: 1394112056 1 1705.891 18.054664
## ---
## 14081: 1394120120 1 3762.915 19.516140
## 14082: 1394120120 1 2902.979 14.085581
## 14083: 1394120120 1 2602.196 11.170207
## 14084: 1394120121 1 1898.852 4.939276
## 14085: 1394120121 1 1150.547 4.939276
4. Now let’s see what this looks like:
10
11. 0
1
2
3
4
0
4000
8000
12000
0
10
20
30
40
50
gearrpmspeed
07:30 08:00 08:30 09:00 09:30
Time
variable
gear
rpm
speed
Perform our calculations with timeseriesr 3.Now that we have our data in memory, let’s do our
calculation! Below is a function to calculate gal_per_hr using our three variables (rpm, gear, and speed)
# I estimated this by looking at some relations between acceleration,
# vehicle weight, torque, and all of the other things you see below on the magical world wide web
#calculates Fuel Usage by row, returns array of fuel usage
#engine power = torque * rpm / 5252
# The constant 5252 is the rounded value of (33,000 ft?lbf/min)/(2?? rad/rev).
# fuel usage: if Power < 0, alpha
# : alpha + min(Pmax, engine power)
# alpha = constant idle fuel consumption rate
# (estimate of fuel sued to maintain engine operation)
# Pt = power = min(Pmax, torque + inertia(mass * radius^2))
gal_per_hr = function(engine_gear,
engine_rpm,
time,
speed,
alpha = .7,
mass = 10000,
gear_ratio = 1.2,
Pmax = 200,
efficiency_parameter = .02,
acceleration_parameter = .01,
wheel_radius){
torque <- c(diff(engine_rpm),0)
torque[torque<0] <- 0
torque <- torque * engine_gear * gear_ratio
11
12. Pt <- torque * engine_rpm / (33000/2/pi)
Pt[Pt>200] <- 200
engine_power <- t
acceleration <- c(-diff(speed),0) / c(-diff(time),1)
#Pi = Mv(kg) * acceleration * velocity /1000
fuel <- alpha +
efficiency_parameter * Pt +
acceleration_parameter * acceleration * mass*.45359/1000 * speed
fuel[fuel < alpha] <- alpha
fuel[is.nan(fuel)] <- alpha
return(fuel)
}
Actually perform that operation
data2$rbinded <- lapply(data2$rbinded,
function(i){
i$gal_per_sec <- gal_per_hr(engine_gear = i$gear,
engine_rpm = i$rpm,
speed = i$speed,
time = i$date_time)/60/60
# currently in hours, but
# want it to be in seconds
# like the time column
i %>% setDT
})
data2$rbinded[[1]]
## date_time gear rpm speed gal_per_sec
## 1: 1394112055 1 1012.492 16.937217 0.0008217549
## 2: 1394112055 1 1228.810 17.264578 0.0009034893
## 3: 1394112055 1 1459.631 17.585472 0.0010965335
## 4: 1394112056 1 1757.546 17.743490 0.0005422796
## 5: 1394112056 1 1705.891 18.054664 0.0014310010
## ---
## 14081: 1394120120 1 3762.915 19.516140 0.0001944444
## 14082: 1394120120 1 2902.979 14.085581 0.0001944444
## 14083: 1394120120 1 2602.196 11.170207 0.0001944444
## 14084: 1394120121 1 1898.852 4.939276 0.0001944444
## 14085: 1394120121 1 1150.547 4.939276 0.0001944444
Now we have done our main calculation. But we want to know the total amount of gallons each truck burned
per day. We are going to use calc_area, which is a timeseriesr function that calculates the area under the
curve. It uses a Riemann left sum approach, but other methods may be added in the future.
#calculates area under the curve as a total
data2$gallons <- data2$rbinded %>%
sapply(function(x)
timeseriesr::calc_area(time_date = x$date_time,
value = x$gal_per_sec))
#calculates the total time range in hours (data is recorded in seconds)
12
13. data2$hours <- data2$rbinded %>%
sapply(function(x)
(max(x$date_time) - min(x$date_time))/60/60)
#calculates the average speed
data2$mph <- data2$rbinded %>%
sapply(function(x) mean(x$speed))
data2[,gal_per_hr := gallons/hours]
## vin day airport rbinded gallons hours
## 1: 1CKPH7747ZK277944 20140306 JFK <data.table> 3.1177370 2.240531
## 2: 1CWPJ7320VE852372 20140306 JFK <data.table> 0.9673207 1.033856
## 3: 1FEMY6958XU502984 20140306 JFK <data.table> 1.5951636 1.757431
## 4: 1FVTK43691G456340 20140306 JFK <data.table> 3.6247181 3.852247
## 5: 1TULD4346YD544661 20140306 JFK <data.table> 1.6412674 1.803878
## mph gal_per_hr
## 1: 9.026001 1.3915173
## 2: 10.746364 0.9356439
## 3: 14.793853 0.9076681
## 4: 13.923766 0.9409360
## 5: 9.326923 0.9098551
We have now calculated the total number of gallons each truck burned, the total hours it ran,
miles per hour, and its average gallons per hour. Now, let’s put all of this back into HBase to move
on to the day/airport.
Store Our Results Using rhbase
We are going to store all of the information that we collected back into the same HBase table for later use.
That includes: 1. rbinded data.table (because we might want to reuse it later) 2. gallons 3. total hours in
operation 4. gal_per_hr 5. average mph
To do this we need to slightly reorganize our table to fit the hb.put standards. Nonetheless, the tidyr
package allows us to do this with ease.
#First, create rowkey with unite
data3 <- tidyr::unite(data2, "rowkey",airport:vin,sep = "::")
#Second, reoganize data with gather
data3 <- tidyr::gather(data3, column, value, -rowkey,convert = T)
## Warning in melt.data.table(data, measure.vars = gather_cols, variable.name
## = key_col, : All 'measure.vars are NOT of the SAME type. By order of
## hierarchy, the molten data value column will be of type 'list'. Therefore
## all measure variables that are not of type 'list' will be coerced to.
## Check the DETAILS section of ?melt.data.table for more on coercion.
data3[c(1,6,11,16,21)]
## rowkey column value
## 1: JFK::20140306::1CKPH7747ZK277944 rbinded <data.table>
## 2: JFK::20140306::1CKPH7747ZK277944 gallons 3.117737
13
14. ## 3: JFK::20140306::1CKPH7747ZK277944 hours 2.240531
## 4: JFK::20140306::1CKPH7747ZK277944 mph 9.026001
## 5: JFK::20140306::1CKPH7747ZK277944 gal_per_hr 1.391517
Great! Now that we have it in the format we want, let’s put it back in HBase:
#Assuming the hb.init connection is still valid
hb.put("Test",
"test",
rowkey = data3$rowkey,
column = data3$column,
value = data3$value)
## [1] TRUE
And just to test it out, let’s see what happens when we pull back one of the new columns we added:
hb.pull("Test","test",c("gal_per_hr","rbinded"))
## rowkey column_family column values
## 1: JFK::20140306::1CKPH7747ZK277944 test gal_per_hr 1.391517
## 2: JFK::20140306::1CKPH7747ZK277944 test rbinded <data.table>
## 3: JFK::20140306::1CWPJ7320VE852372 test gal_per_hr 0.9356439
## 4: JFK::20140306::1CWPJ7320VE852372 test rbinded <data.table>
## 5: JFK::20140306::1FEMY6958XU502984 test gal_per_hr 0.9076681
## 6: JFK::20140306::1FEMY6958XU502984 test rbinded <data.table>
## 7: JFK::20140306::1FVTK43691G456340 test gal_per_hr 0.940936
## 8: JFK::20140306::1FVTK43691G456340 test rbinded <data.table>
## 9: JFK::20140306::1TULD4346YD544661 test gal_per_hr 0.9098551
## 10: JFK::20140306::1TULD4346YD544661 test rbinded <data.table>
Pretty cool!
What is great about this is that you can easily parallelize this operation across multiple cores/nodes/clusters
because we broke it down into intervals (airport and date). That is, our data model and query pattern
were designed for easily doing a fork and join operation. Check out the parallel package or even the rmr2
package for more details of how to do that.
If you have any suggestions, comments, or questions please feel free to contact me. Also if you would like to
further contribute to the rhbase fork or timeseriesr package, all contributions are welcome.
14