熊嘉男
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
Ashish Singhi
HBase Disaster recovery solution aims to maintain high availability of HBase service in case of disaster of one HBase cluster with very minimal user intervention. This session will introduce the HBase disaster recovery use cases and the various solutions adopted at Huawei like.
a) Cluster Read-Write mode
b) DDL operations synchronization with standby cluster
c) Mutation and bulk loaded data replication
d) Further challenges and pending work
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
Escalando Foursquare basado en Checkins y RecomendacionesManuel Vargas
1) Foursquare scaled its data storage by sharding and replicating across multiple databases as user and venue data grew significantly.
2) As the application complexity increased, Foursquare transitioned to a service-oriented architecture using Finagle for RPC but faced challenges with duplication, tracing issues, and reliability.
3) Foursquare developed common tools for builds, deploys, monitoring, tracing, and circuit breaking to help manage the increasingly distributed system and facilitate independent development of features.
2015 deploying flash in the data centerHoward Marks
Deploying Flash in the Data Center discusses various ways to deploy flash storage in the data center to improve performance. It describes all-flash arrays that provide the highest performance but also more expensive options like hybrid arrays that combine flash and disk. It also covers using flash in servers or as a cache to accelerate storage arrays. Choosing the best approach depends on factors like workload, budget, and existing infrastructure.
This document discusses using HBase for online transaction processing (OLTP) workloads. It provides background on SQL-on-Hadoop and transaction processing with snapshot isolation. It then describes challenges in adding transactions directly to HBase, including using additional system tables to coordinate transactions. Examples are given for implementing transactions in HBase, along with issues like rollback handling. Finally, it discusses using SQL interfaces like Apache Phoenix or Drill on top of HBase, as well as open questions around the future of OLTP and OLAP processing on Hadoop versus traditional databases.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
In DiDi Chuxing Company, which is China’s most popular ride-sharing company. we use HBase to serve when we have a bigdata problem.
We run three clusters which serve different business needs. We backported the Region Grouping feature back to our internal HBase version so we could isolate the different use cases.
We built the Didi HBase Service platform which is popular amongst engineers at our company. It includes a workflow and project management function as well as a user monitoring view.
Internally we recommend users use Phoenix to simplify access.even more,we used row timestamp;multidimensional table schema to slove muti dimension query problems
C++, Go, Python, and PHP clients get to HBase via thrift2 proxies and QueryServer.
We run many important buisness applications out of our HBase cluster such as ETA/GPS/History Order/API metrics monitoring/ and Traffic in the Cloud. If you are interested in any aspects listed above, please come to our talk. We would like to share our experiences with you.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
hbaseconasia2017: HBase Disaster Recovery Solution at HuaweiHBaseCon
Ashish Singhi
HBase Disaster recovery solution aims to maintain high availability of HBase service in case of disaster of one HBase cluster with very minimal user intervention. This session will introduce the HBase disaster recovery use cases and the various solutions adopted at Huawei like.
a) Cluster Read-Write mode
b) DDL operations synchronization with standby cluster
c) Mutation and bulk loaded data replication
d) Further challenges and pending work
hbaseconasia2017 hbasecon hbase https://www.eventbrite.com/e/hbasecon-asia-2017-tickets-34935546159#
Escalando Foursquare basado en Checkins y RecomendacionesManuel Vargas
1) Foursquare scaled its data storage by sharding and replicating across multiple databases as user and venue data grew significantly.
2) As the application complexity increased, Foursquare transitioned to a service-oriented architecture using Finagle for RPC but faced challenges with duplication, tracing issues, and reliability.
3) Foursquare developed common tools for builds, deploys, monitoring, tracing, and circuit breaking to help manage the increasingly distributed system and facilitate independent development of features.
2015 deploying flash in the data centerHoward Marks
Deploying Flash in the Data Center discusses various ways to deploy flash storage in the data center to improve performance. It describes all-flash arrays that provide the highest performance but also more expensive options like hybrid arrays that combine flash and disk. It also covers using flash in servers or as a cache to accelerate storage arrays. Choosing the best approach depends on factors like workload, budget, and existing infrastructure.
This document discusses using HBase for online transaction processing (OLTP) workloads. It provides background on SQL-on-Hadoop and transaction processing with snapshot isolation. It then describes challenges in adding transactions directly to HBase, including using additional system tables to coordinate transactions. Examples are given for implementing transactions in HBase, along with issues like rollback handling. Finally, it discusses using SQL interfaces like Apache Phoenix or Drill on top of HBase, as well as open questions around the future of OLTP and OLAP processing on Hadoop versus traditional databases.
Building Apps with Distributed In-Memory Computing Using Apache GeodePivotalOpenSourceHub
Slides from the Meetup Monday March 7, 2016 just before the beginning of #GeodeSummit, where we cover an introduction of the technology and community that is Apache Geode, the in-memory data grid.
This document summarizes a presentation about using the HBase database with Ruby on Rails applications. It discusses what HBase is, some of the tradeoffs it involves compared to relational databases, when it may be suitable versus not suitable for an application, and how to interface with it from Rails. Examples are provided of libraries that can be used to connect Rails and HBase, as well as demos of JRuby scripts and Rails code that access an HBase backend.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Before joining Couchbase Phil has been a consultant on many different node.js and NoSQL projects working with many different languages and databases. By helping clients solve problems regarding scalability as well building completely new APIs he gained a broad knowledge of the available platforms and their tradeoffs in the big and small. He's a Developer Evangelist for Couchbase where he works to educate developers on the different parts of using a NoSQL database from mobile to big iron servers.
Innovation with Connection, The new HPCC Systems Plugins and ModulesHPCC Systems
As part of the 2018 HPCC Systems Summit Community Day event:
The HPCC Systems platform team continues to expand interoperability with third party systems, which increases the platform feature-set and facilitates custom solutions. James will share an update on the latest connectors available, including the Spark-HPCC, and the upcoming HDFS connector plugin.
James McMullan has a broad range Software Engineering experience from developing low level system drivers for X-Ray fluorescence equipment to mobile video games and web applications. He is a recent addition to the Lexis Nexis team and is part of the HPCC Systems Platform team where he has been working on connectors integrating HPCC Systems with the Spark & Hadoop ecosystems.
It has just been a few months since the PostgreSQL9.5 is released. We have got some of our customers excited about great new features and performance enhancements in v9.5. But here we are already taking a peak into the next version, and we find it awesome! One of the most awaited features – parallelism makes it to Postgres. The infrastructure for parallelism has been added over last few releases but the first parallel operation in query execution will be seen only in v9.6.
Join Postgres experts Marc Linster and Devrim Gündüz as they provide a step by step guide for installing PostgreSQL and EDB Postgres Advanced Server on Linux.
Highlights include:
- The advantages of native packages
- An in-depth look at RPMs and DEBs
- A step-by-step demo
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
The document discusses how operationalizing machine learning models through continuous deployment and monitoring is important to realize business value but often overlooked, and describes how Alpine Data's Chorus platform in combination with Pivotal's Big Data Suite and Cloud Foundry can provide a turn-key solution for operationalizing models by deploying scalable scoring engines that can consume models exported in the PFA format. The platform aims to make it simple to deploy both individual models and complex scoring flows represented as PFA documents to ensure models have maximum impact on the business.
Powering GIS Application with PostgreSQL and Postgres Plus Ashnikbiz
This document provides an overview of Postgres Plus Advanced Server and its features. It begins with introductions to PostgreSQL and PostGIS. It then discusses Postgres Plus Advanced Server's Oracle compatibility, performance enhancements, security features, high availability options, database administration tools, and migration toolkit. The document also provides information on scaling Postgres Plus Advanced Server through partitioning and infinite cache technologies. It concludes with summaries of the replication capabilities of Postgres Plus Advanced Server.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
Trusted advisory on technology comparison --exadata, hana, db2Ajay Kumar Uppal
- SAP HANA is a column-oriented, in-memory database that promises performance gains of up to 100,000x over traditional databases and enables new real-time use cases. Its appliance model reduces costs by simplifying infrastructure requirements. However, it requires new extreme main memory hardware and has limitations for high availability, disaster recovery, and virtualization initially.
- Oracle Exadata is an optimized hardware and software appliance for Oracle Database that scales to hundreds of terabytes. It provides fast performance through SSD caching and compression but does not have a true column-oriented architecture. Additional products like TimesTen and Essbase are needed for optimal OLAP support.
- IBM DB2 with BLU extension provides query acceleration for OL
Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing.
Asynchronous cascading master to multiple replicas
Asynchronous multi-master
Can be used for:
Improved performance for geographically dispersed users
High availability
Load distribution (OLTP vs. reporting)
Various HA and DR setups for Postgres Plus Advanced Server -
Active – Passive OS HA Clustering
Log Shipping Replication (Hot Standby Mode)
Hot Streaming Replication (Hot Standby Mode)
EDB Postgres Plus Failover Manager
HA with read scaling (with pg-pool)
xDB Single Master Replication (SMR)
xDB Multi Master Replication (MMR)
Use Cases
This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
Training Slides: Basics 103: The Power of Tungsten Connector / ProxyContinuent
Tungsten Connector / Proxy for MySQL is truly the secret sauce for the Tungsten Clustering solution. Join us for a basic 30min introduction and tour of Tungsten Connector / Proxy, and gain an understanding of the various SQL routing methods available in Tungsten Connector / Proxy.
AGENDA
- Review the cluster architecture
- Understand the role of the Connector
- Explore Connector routing methods
- Discuss user authentication
- Review configuration files and their locations
- Explore the command line interface
If you are seeking ways to improve your cloud database environment with EDB Postgres, this presentation reviews how you can create a Database-as-a-Service (DBaaS) with EDB Postgres on AWS.
This presentation outlines how EDB Ark can play a key role in your digital transformation with more agility and speed.
It highlights:
● How EDB Ark can integrate with your existing AWS environment and other clouds
● How you can automate your database deployments to instantly spin up new databases
● How to manage your database environment easier using the same GUI for all clouds
● How to boost developer efficiency and satisfaction
Whether your database is currently in the cloud or you are considering the cloud as an option, this presentation will provide you with the information you need to evaluate EDB Postgres and EDB Ark.
The recording of this presentation includes a demonstration. Visit www.edbpostgres.com > resources > webcasts
This document discusses managing storage across public and private resources. It covers the evolution of on-site storage management, storage options in the public cloud, and challenges of managing hybrid cloud storage. Key topics include the transition from siloed storage to software-defined storage, various cloud storage services like object storage and block storage, challenges of public cloud limitations, and solutions for connecting on-site and cloud storage like gateways, file systems, and caching appliances.
This document discusses Spark streaming and Kafka for stream processing. It provides an overview of Spark streaming concepts like RDDs and DStreams. It also introduces Kafka and describes approaches for streaming data ingestion from Kafka into Spark using either receiver-based or direct approaches. The direct approach allows for exactly-once semantics through Spark checkpoints and atomic transactions. The document also discusses options for storing offsets either using Spark checkpoints or an external data store.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
Hbase status quo apache-con europe - nov 2012Chris Huang
The document summarizes the status of HBase and its relationship with HDFS. In the past, HDFS did not prioritize HBase's needs, but reliability, availability, and performance have improved with Hadoop 1.0 and 2.0. Hadoop 2.0 features like HDFS high availability and wire compatibility directly benefit HBase. Further improvements planned for Hadoop 2.x like direct reads and zero-copy support could significantly boost HBase performance. The HBase project is also advancing with new versions focused on features like coprocessors and performance optimizations.
This document summarizes a presentation about using the HBase database with Ruby on Rails applications. It discusses what HBase is, some of the tradeoffs it involves compared to relational databases, when it may be suitable versus not suitable for an application, and how to interface with it from Rails. Examples are provided of libraries that can be used to connect Rails and HBase, as well as demos of JRuby scripts and Rails code that access an HBase backend.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Before joining Couchbase Phil has been a consultant on many different node.js and NoSQL projects working with many different languages and databases. By helping clients solve problems regarding scalability as well building completely new APIs he gained a broad knowledge of the available platforms and their tradeoffs in the big and small. He's a Developer Evangelist for Couchbase where he works to educate developers on the different parts of using a NoSQL database from mobile to big iron servers.
Innovation with Connection, The new HPCC Systems Plugins and ModulesHPCC Systems
As part of the 2018 HPCC Systems Summit Community Day event:
The HPCC Systems platform team continues to expand interoperability with third party systems, which increases the platform feature-set and facilitates custom solutions. James will share an update on the latest connectors available, including the Spark-HPCC, and the upcoming HDFS connector plugin.
James McMullan has a broad range Software Engineering experience from developing low level system drivers for X-Ray fluorescence equipment to mobile video games and web applications. He is a recent addition to the Lexis Nexis team and is part of the HPCC Systems Platform team where he has been working on connectors integrating HPCC Systems with the Spark & Hadoop ecosystems.
It has just been a few months since the PostgreSQL9.5 is released. We have got some of our customers excited about great new features and performance enhancements in v9.5. But here we are already taking a peak into the next version, and we find it awesome! One of the most awaited features – parallelism makes it to Postgres. The infrastructure for parallelism has been added over last few releases but the first parallel operation in query execution will be seen only in v9.6.
Join Postgres experts Marc Linster and Devrim Gündüz as they provide a step by step guide for installing PostgreSQL and EDB Postgres Advanced Server on Linux.
Highlights include:
- The advantages of native packages
- An in-depth look at RPMs and DEBs
- A step-by-step demo
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
The document discusses how operationalizing machine learning models through continuous deployment and monitoring is important to realize business value but often overlooked, and describes how Alpine Data's Chorus platform in combination with Pivotal's Big Data Suite and Cloud Foundry can provide a turn-key solution for operationalizing models by deploying scalable scoring engines that can consume models exported in the PFA format. The platform aims to make it simple to deploy both individual models and complex scoring flows represented as PFA documents to ensure models have maximum impact on the business.
Powering GIS Application with PostgreSQL and Postgres Plus Ashnikbiz
This document provides an overview of Postgres Plus Advanced Server and its features. It begins with introductions to PostgreSQL and PostGIS. It then discusses Postgres Plus Advanced Server's Oracle compatibility, performance enhancements, security features, high availability options, database administration tools, and migration toolkit. The document also provides information on scaling Postgres Plus Advanced Server through partitioning and infinite cache technologies. It concludes with summaries of the replication capabilities of Postgres Plus Advanced Server.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
This document provides an overview of important concepts for operating HBase, including:
- HBase stores data in columns families stored as files on disk and writes to memory before flushing to disk.
- Manual and automatic splitting of regions is covered, as well as challenges of improper splitting.
- Tools for monitoring, debugging, and visualizing HBase operations are discussed.
- Key lessons focus on proper data modeling, extensive monitoring, and understanding the whole Hadoop ecosystem.
Trusted advisory on technology comparison --exadata, hana, db2Ajay Kumar Uppal
- SAP HANA is a column-oriented, in-memory database that promises performance gains of up to 100,000x over traditional databases and enables new real-time use cases. Its appliance model reduces costs by simplifying infrastructure requirements. However, it requires new extreme main memory hardware and has limitations for high availability, disaster recovery, and virtualization initially.
- Oracle Exadata is an optimized hardware and software appliance for Oracle Database that scales to hundreds of terabytes. It provides fast performance through SSD caching and compression but does not have a true column-oriented architecture. Additional products like TimesTen and Essbase are needed for optimal OLAP support.
- IBM DB2 with BLU extension provides query acceleration for OL
Apache Geode provides a database-like consistency model, reliable transaction processing and a shared-nothing architecture to maintain very low latency performance with high concurrency processing.
Asynchronous cascading master to multiple replicas
Asynchronous multi-master
Can be used for:
Improved performance for geographically dispersed users
High availability
Load distribution (OLTP vs. reporting)
Various HA and DR setups for Postgres Plus Advanced Server -
Active – Passive OS HA Clustering
Log Shipping Replication (Hot Standby Mode)
Hot Streaming Replication (Hot Standby Mode)
EDB Postgres Plus Failover Manager
HA with read scaling (with pg-pool)
xDB Single Master Replication (SMR)
xDB Multi Master Replication (MMR)
Use Cases
This document summarizes a presentation about optimizing for low latency in HBase. It discusses how to measure latency, the write and read paths in HBase, sources of latency like garbage collection and compactions, and techniques for reducing latency like streaming puts, block caching, and timeline consistency. The key points are that single puts can achieve millisecond latency while garbage collection and machine failures can cause pauses of 10s of milliseconds to seconds, and optimizing for the "magical 1%" of requests after the 99th percentile is important to improve average latency.
HBase Accelerated introduces an in-memory flush and compaction pipeline for HBase to improve performance of real-time workloads. By keeping data in memory longer and avoiding frequent disk flushes and compactions, it reduces I/O and improves read and scan latencies. Evaluation on workloads with high update rates and small working sets showed the new approach significantly outperformed the default HBase implementation by serving most data from memory. Work is ongoing to further optimize the in-memory representation and memory usage.
Training Slides: Basics 103: The Power of Tungsten Connector / ProxyContinuent
Tungsten Connector / Proxy for MySQL is truly the secret sauce for the Tungsten Clustering solution. Join us for a basic 30min introduction and tour of Tungsten Connector / Proxy, and gain an understanding of the various SQL routing methods available in Tungsten Connector / Proxy.
AGENDA
- Review the cluster architecture
- Understand the role of the Connector
- Explore Connector routing methods
- Discuss user authentication
- Review configuration files and their locations
- Explore the command line interface
If you are seeking ways to improve your cloud database environment with EDB Postgres, this presentation reviews how you can create a Database-as-a-Service (DBaaS) with EDB Postgres on AWS.
This presentation outlines how EDB Ark can play a key role in your digital transformation with more agility and speed.
It highlights:
● How EDB Ark can integrate with your existing AWS environment and other clouds
● How you can automate your database deployments to instantly spin up new databases
● How to manage your database environment easier using the same GUI for all clouds
● How to boost developer efficiency and satisfaction
Whether your database is currently in the cloud or you are considering the cloud as an option, this presentation will provide you with the information you need to evaluate EDB Postgres and EDB Ark.
The recording of this presentation includes a demonstration. Visit www.edbpostgres.com > resources > webcasts
This document discusses managing storage across public and private resources. It covers the evolution of on-site storage management, storage options in the public cloud, and challenges of managing hybrid cloud storage. Key topics include the transition from siloed storage to software-defined storage, various cloud storage services like object storage and block storage, challenges of public cloud limitations, and solutions for connecting on-site and cloud storage like gateways, file systems, and caching appliances.
This document discusses Spark streaming and Kafka for stream processing. It provides an overview of Spark streaming concepts like RDDs and DStreams. It also introduces Kafka and describes approaches for streaming data ingestion from Kafka into Spark using either receiver-based or direct approaches. The direct approach allows for exactly-once semantics through Spark checkpoints and atomic transactions. The document also discusses options for storing offsets either using Spark checkpoints or an external data store.
Big Data and Hadoop - History, Technical Deep Dive, and Industry TrendsEsther Kundin
An overview of the history of Big Data, followed by a deep dive into the Hadoop ecosystem. Detailed explanation of how HDFS, MapReduce, and HBase work, followed by a discussion of how to tune HBase performance. Finally, a look at industry trends, including challenges faced and being solved by Bloomberg for using Hadoop for financial data.
Hbase status quo apache-con europe - nov 2012Chris Huang
The document summarizes the status of HBase and its relationship with HDFS. In the past, HDFS did not prioritize HBase's needs, but reliability, availability, and performance have improved with Hadoop 1.0 and 2.0. Hadoop 2.0 features like HDFS high availability and wire compatibility directly benefit HBase. Further improvements planned for Hadoop 2.x like direct reads and zero-copy support could significantly boost HBase performance. The HBase project is also advancing with new versions focused on features like coprocessors and performance optimizations.
HBase Status Report - Hadoop Summit Europe 2014larsgeorge
This document provides a summary of new features and improvements in recent versions of Apache HBase, a distributed, scalable, big data store. It discusses major changes and enhancements in HBase 0.92+, 0.94+, and 0.96+, including new HFile formats, coprocessors, caching improvements, performance tuning, and more. The document is intended to bring readers up to date on the current state and capabilities of HBase.
The document discusses several key topics in Apache HBase:
1. Procedure version 2 introduces a new framework for running operations like create/drop table and region assignment as procedures with distinct phases.
2. Assignment Manager version 2 uses procedures and improves region assignment and load balancing.
3. Backup/restore now supports HDFS, S3, ADLS and WASB. Snapshots can also be used for backup.
4. Compacting memstore allows in-memory flushing and compaction to improve performance through pipelining.
This document summarizes a talk about Facebook's use of HBase for messaging data. It discusses how Facebook migrated data from MySQL to HBase to store metadata, search indexes, and small messages in HBase for improved scalability. It also outlines performance improvements made to HBase, such as for compactions and reads, and future plans such as cross-datacenter replication and running HBase in a multi-tenant environment.
The document discusses Facebook's use of HBase to store messaging data. It provides an overview of HBase, including its data model, performance characteristics, and how it was a good fit for Facebook's needs due to its ability to handle large volumes of data, high write throughput, and efficient random access. It also describes some enhancements Facebook made to HBase to improve availability, stability, and performance. Finally, it briefly mentions Facebook's migration of messaging data from MySQL to their HBase implementation.
The document discusses Facebook's use of HBase as the database storage engine for its messaging platform. It provides an overview of HBase, including its data model, architecture, and benefits like scalability, fault tolerance, and simpler consistency model compared to relational databases. The document also describes Facebook's contributions to HBase to improve performance, availability, and achieve its goal of zero data loss. It shares Facebook's operational experiences running large HBase clusters and discusses its migration of messaging data from MySQL to a de-normalized schema in HBase.
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...Michael Stack
This document discusses techniques used by Pinterest to improve the reliability of HBase and reduce the cost and complexity of backing up HBase data. It describes how Pinterest uses geo-replication across data centers to provide high availability of HBase clusters. It also details Pinterest's upgrade to their backup pipeline to allow direct export of HBase snapshots and write-ahead logs to Amazon S3, avoiding the need for an intermediate HDFS backup cluster. Additionally, it covers their use of an offline deduplication tool called PinDedup to further reduce S3 storage usage by identifying and replacing duplicate files across backup cycles. This combination of techniques significantly reduced infrastructure costs and backup times for Pinterest's critical HBase data.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Speaker: Varun Sharma (Pinterest)
Over the past year, HBase has become an integral component of Pinterest's storage stack. HBase has enabled us to quickly launch and iterate on new products and create amazing pinner experiences. This talk briefly describes some of these applications, the underlying schema, and how our HBase setup stays highly available and performant despite billions of requests every week. It will also include some performance tips for running on SSDs. Finally, we will talk about a homegrown serving technology we built from a mashup of HBase components that has gained wide adoption across Pinterest.
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
Hive was the first popular SQL layer built on Hadoop and has long been known as a heavyweight SQL engine suitable mainly for long-running batch jobs. This has greatly changed since Hive was announced to the world over 8 years ago. Hortonworks and the open source community have evolved Apache Hive into a fast, dynamic SQL on Hadoop engine capable of running highly concurrent query workloads over large datasets with sub-second response time.
The latest Hortonworks and Azure HDInsight platform versions fully support Hive with LLAP execution engine for production use. In this webinar, we will go through the architecture of Hive + LLAP engine and explain how it differs from previous Hive versions. We will then dive deeper and show how features like query vectorization and LLAP columnar caching bring further automatic performance improvements.
In the end, we will show how Gluent brings these new performance benefits to traditional enterprise database platforms via transparent data virtualization, allowing even your largest databases to benefit from all this without changing any application code. Join this webinar to learn about significant improvements in modern Hive architecture and how Gluent and Hive LLAP on Hortonworks or Azure HDInsight platforms can accelerate cloud migrations and greatly improve hybrid query performance!
- The document discusses running Hive/Spark on S3 object storage using S3A committers and running HBase on NFS file storage instead of HDFS. This separates compute and storage and avoids HDFS operations and complexity. S3A committers allow fast, atomic writes to S3 without renaming files. Benchmark results show the magic committer is faster than the file committer for S3 writes. HBase performance tests show FlashBlade NFS providing low latency for random reads/writes compared to Amazon EFS.
Rocketfuel processes over 120 billion ad auctions per day and needs to detect fraud in real time to prevent losses. They developed Helios, which ingests event data from Kafka and HDFS into Storm in real time, joins the streams in HBase, then runs MapReduce jobs hourly to populate an OLAP cube for analyzing feature vectors and detecting fraud patterns. This architecture on Hadoop allows them to easily scale real-time processing and experiment with different configurations to quickly react to fraud.
Architectural Evolution Starting from HadoopSpagoWorld
Speech given by Monica Franceschini, Solution Architecture Manager at the Big Data Competencey Center of Engineering Group, in occasion of the Data Driven Innovation Rome 2016 - Open Summit.
The document provides an introduction to Hadoop. It discusses how Google developed its own infrastructure using Google File System (GFS) and MapReduce to power Google Search due to limitations with databases. Hadoop was later developed based on these Google papers to provide an open-source implementation of GFS and MapReduce. The document also provides overviews of the HDFS file system and MapReduce programming model in Hadoop.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
This document discusses how StreamHorizon can accelerate big data analytics pipelines. It integrates seamlessly into data processing pipelines and can process data from sources like Spark, Storm, Kafka, and file systems. StreamHorizon reduces network congestion and improves query latency for tools like Impala and Hive. It is portable across platforms and provides real-time and batch processing capabilities through integrations with tools like Storm, Kafka, and Hadoop. StreamHorizon also performs data aggregations during processing to further accelerate querying and reduce network usage.
AWS re:Invent 2016: Streaming ETL for RDS and DynamoDB (DAT315)Amazon Web Services
During this session Greg Brandt and Liyin Tang, Data Infrastructure engineers from Airbnb, will discuss the design and architecture of Airbnb's streaming ETL infrastructure, which exports data from RDS for MySQL and DynamoDB into Airbnb's data warehouse, using a system called SpinalTap. We will also discuss how we leverage Spark Streaming to compute derived data from tracking topics and/or database tables, and HBase to provide immediate data access and generate cleanly time-partitioned Hive tables.
This document discusses techniques for improving latency in HBase. It analyzes the write and read paths, identifying sources of latency such as networking, HDFS flushes, garbage collection, and machine failures. For writes, it finds that single puts can achieve millisecond latency while streaming puts can hide latency spikes. For reads, it notes cache hits are sub-millisecond while cache misses and seeks add latency. GC pauses of 25-100ms are common, and failures hurt locality and require cache rebuilding. The document outlines ongoing work to reduce GC, use off-heap memory, improve compactions and caching to further optimize for low latency.
HBaseCon 2013: Apache HBase and HDFS - Understanding Filesystem Usage in HBaseCloudera, Inc.
This document discusses file system usage in HBase. It describes the main file types in HBase including write ahead logs (WALs), data files, and reference files. It covers topics like durability semantics, IO fencing, and data locality techniques used in HBase like short circuit reads, checksums, and block placement. The document is presented by Enis Söztutar and is intended to help understand how HBase performs IO operations over HDFS for tuning performance.
Similar to hbaseconasia2019 BDS: A data synchronization platform for HBase (20)
hbaseconasia2019 HBase Table Monitoring and Troubleshooting System on CloudMichael Stack
Long Chen
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Recent work on HBase at PinterestMichael Stack
Lianghong Xu
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Phoenix Practice in China Life Insurance Co., LtdMichael Stack
Yechao Chen
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
TianHang Tang
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 The Practice in trillion-level Video Storage and billion-lev...Michael Stack
Xu Ming
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Andrew Cheng
Track 3: Applications
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Spatio temporal Data Management based on Ali-HBase Ganos and...Michael Stack
Fei Xiao of Alibaba
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Bridging the Gap between Big Data System Software Stack and ...Michael Stack
Huan-Ping Su (蘇桓平), Yi-Sheng Lien (連奕盛) National Cheng Kung University
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Pharos as a Pluggable Secondary Index ComponentMichael Stack
Lei Wang China Everbright Bank
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
Junhong Xu of Xiaomi
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 BigData NoSQL System: ApsaraDB, HBase and SparkMichael Stack
Wei Li of Alibaba
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Test-suite for Automating Data-consistency checks on HBaseMichael Stack
Pradeep S, Mallikarjun V of Flipkart
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Distributed Bitmap Index SolutionMichael Stack
Xingjun Hao of Huawei
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 HBase Bucket Cache on Persistent MemoryMichael Stack
Anoop Sam John, Ramkrishna S Vasudevan, and Xu Kai of Intel
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 The Procedure v2 Implementation of WAL Splitting and ACLMichael Stack
Mei Yi of Xiaomi
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 Further GC optimization for HBase 2.x: Reading HFileBlock in...Michael Stack
Anoop Sam John of Intel and Zheng Hu of Alibaba
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
hbaseconasia2019 HBCK2: Concepts, trends, and recipes for fixing issues in HB...Michael Stack
The document discusses HBCK2, a tool for fixing issues in HBase 2. Some key points:
1. HBCK2 is simpler than HBCK1, with fewer fix commands and no diagnosis commands. It requires a deeper understanding of HBase internals.
2. HBCK2 commands are master-oriented and fix issues one at a time. Common issues include regions not online, stuck procedures, and tables in the wrong state.
3. Recipes are provided to fix specific issues like missing meta regions or regions in transition using HBCK2 commands like assigns and bypass.
4. HBCK2 is still a work in progress but contributions are welcome
Keynote given by Duo Zhang of Xiaomi and Chunhui Shen of Alibab
Track 1: Internals
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
HBaseConAsia2018 Track3-1: Serving billions of queries in millisecond latenciesMichael Stack
This document discusses how Bloomberg uses HBase to serve billions of queries with millisecond latency. It covers HBase principles like being an ordered key-value store and providing ACID transactions. It also discusses modeling data for HBase, including dealing with data and query skew. Implementation details covered include caching, block size tuning, column families, and compaction. The overall goal is to optimize HBase for Bloomberg's low-latency data storage and retrieval needs.
Instagram has become one of the most popular social media platforms, allowing people to share photos, videos, and stories with their followers. Sometimes, though, you might want to view someone's story without them knowing.
Meet up Milano 14 _ Axpo Italia_ Migration from Mule3 (On-prem) to.pdfFlorence Consulting
Quattordicesimo Meetup di Milano, tenutosi a Milano il 23 Maggio 2024 dalle ore 17:00 alle ore 18:30 in presenza e da remoto.
Abbiamo parlato di come Axpo Italia S.p.A. ha ridotto il technical debt migrando le proprie APIs da Mule 3.9 a Mule 4.4 passando anche da on-premises a CloudHub 1.0.
Gen Z and the marketplaces - let's translate their needsLaura Szabó
The product workshop focused on exploring the requirements of Generation Z in relation to marketplace dynamics. We delved into their specific needs, examined the specifics in their shopping preferences, and analyzed their preferred methods for accessing information and making purchases within a marketplace. Through the study of real-life cases , we tried to gain valuable insights into enhancing the marketplace experience for Generation Z.
The workshop was held on the DMA Conference in Vienna June 2024.
Understanding User Behavior with Google Analytics.pdfSEO Article Boost
Unlocking the full potential of Google Analytics is crucial for understanding and optimizing your website’s performance. This guide dives deep into the essential aspects of Google Analytics, from analyzing traffic sources to understanding user demographics and tracking user engagement.
Traffic Sources Analysis:
Discover where your website traffic originates. By examining the Acquisition section, you can identify whether visitors come from organic search, paid campaigns, direct visits, social media, or referral links. This knowledge helps in refining marketing strategies and optimizing resource allocation.
User Demographics Insights:
Gain a comprehensive view of your audience by exploring demographic data in the Audience section. Understand age, gender, and interests to tailor your marketing strategies effectively. Leverage this information to create personalized content and improve user engagement and conversion rates.
Tracking User Engagement:
Learn how to measure user interaction with your site through key metrics like bounce rate, average session duration, and pages per session. Enhance user experience by analyzing engagement metrics and implementing strategies to keep visitors engaged.
Conversion Rate Optimization:
Understand the importance of conversion rates and how to track them using Google Analytics. Set up Goals, analyze conversion funnels, segment your audience, and employ A/B testing to optimize your website for higher conversions. Utilize ecommerce tracking and multi-channel funnels for a detailed view of your sales performance and marketing channel contributions.
Custom Reports and Dashboards:
Create custom reports and dashboards to visualize and interpret data relevant to your business goals. Use advanced filters, segments, and visualization options to gain deeper insights. Incorporate custom dimensions and metrics for tailored data analysis. Integrate external data sources to enrich your analytics and make well-informed decisions.
This guide is designed to help you harness the power of Google Analytics for making data-driven decisions that enhance website performance and achieve your digital marketing objectives. Whether you are looking to improve SEO, refine your social media strategy, or boost conversion rates, understanding and utilizing Google Analytics is essential for your success.
Discover the benefits of outsourcing SEO to Indiadavidjhones387
"Discover the benefits of outsourcing SEO to India! From cost-effective services and expert professionals to round-the-clock work advantages, learn how your business can achieve digital success with Indian SEO solutions.
4. • HBase support cross-version migration without downtime?
• HBase support data backup to OSS or other storage?
• HBase support replicate incremental data to MQ,ES,Solr?
• Replicate incremental data from RDS to HBase?
• HBase data can be archived to Spark cluster for offline analysis?
• HBase High Availability
…….
12. HBase full data migration
• Avoid the impact on business
• Only access HDFS
• Dynamic migration rate
• Decoupled from HBase
• One-click migration
• Create table automaticlly
• Perceive changes in region
• Perceive HFiles compaction
• Efficient
• 100MB/s (single node)
• Higher scalability
13. Data localization rate
DataNode1 DataNode2
HFile HFile
RegionServer
Region
Local read remote read
• Data migration takes the issue of
data localization rates into account
• Avoid low localization rate after data
migration
20. Operation and maintenance
•BDS
•Easy to expand
•Easy to upgrade
•monitor
•alarm mechanism
•HBase Replication
•Bug fix
•No alarm
•Configuration modification and
system upgrade requires RS to
restart