Embulk is a parallel bulk data loader that uses plugins to integrate data from various sources into a more relaxed format. One such plugin converts data into GeoJSON format, allowing data to be encoded with geographic metadata. The presentation demonstrated using Embulk and this GeoJSON plugin to extract open data from a Japanese government site and visualize it using D3.js on a map, highlighting Embulk's ability to simplify complex data integration tasks.
Erasure coding in HDFS provides redundancy for data blocks while using less storage space compared to simple replication. It works by splitting files into data blocks and parity blocks striped across multiple data nodes. When reading data, erasure coding reconstructs missing or corrupted blocks from the parity blocks to maintain data reliability even if some blocks are lost. Erasure coding reduces storage overhead compared to replication and improves read performance by allowing reconstruction of missing blocks in parallel from different nodes.
Managing multi tenant resource toward Hive 2.0Kai Sasaki
This document discusses Treasure Data's migration architecture for managing resources across multiple clusters when upgrading from Hive 1.x to Hive 2.0. It introduces components like PerfectQueue and Plazma that enable blue-green deployment without downtime. It also describes how automatic testing and validation is done to prevent performance degradation. Resource management is discussed to define resources per account across different job queues and Hadoop clusters. Brief performance comparisons show improvements from Hive 2.x features like Tez and vectorization.
Kai Sasaki discusses Treasure Data's architecture for maintaining Hadoop on the cloud. Some key points are using stateless services like Hive metastore and cloud storage. They also manage multiple Hadoop versions by downloading packages from S3. Regression tests on Hive queries and a REST API help ensure changes don't cause issues. An RDBMS-based queue provides persistence and scheduling across tasks. The overall aim is high maintainability through statelessness, mobility of components, and queueing of jobs.
Embulk is a parallel bulk data loader that uses plugins to integrate data from various sources into a more relaxed format. One such plugin converts data into GeoJSON format, allowing data to be encoded with geographic metadata. The presentation demonstrated using Embulk and this GeoJSON plugin to extract open data from a Japanese government site and visualize it using D3.js on a map, highlighting Embulk's ability to simplify complex data integration tasks.
Erasure coding in HDFS provides redundancy for data blocks while using less storage space compared to simple replication. It works by splitting files into data blocks and parity blocks striped across multiple data nodes. When reading data, erasure coding reconstructs missing or corrupted blocks from the parity blocks to maintain data reliability even if some blocks are lost. Erasure coding reduces storage overhead compared to replication and improves read performance by allowing reconstruction of missing blocks in parallel from different nodes.
Managing multi tenant resource toward Hive 2.0Kai Sasaki
This document discusses Treasure Data's migration architecture for managing resources across multiple clusters when upgrading from Hive 1.x to Hive 2.0. It introduces components like PerfectQueue and Plazma that enable blue-green deployment without downtime. It also describes how automatic testing and validation is done to prevent performance degradation. Resource management is discussed to define resources per account across different job queues and Hadoop clusters. Brief performance comparisons show improvements from Hive 2.x features like Tez and vectorization.
Kai Sasaki discusses Treasure Data's architecture for maintaining Hadoop on the cloud. Some key points are using stateless services like Hive metastore and cloud storage. They also manage multiple Hadoop versions by downloading packages from S3. Regression tests on Hive queries and a REST API help ensure changes don't cause issues. An RDBMS-based queue provides persistence and scheduling across tasks. The overall aim is high maintainability through statelessness, mobility of components, and queueing of jobs.