HBaseCon 2013: Apache HBase Operations at Pinterest

YapMap is a new kind of search platform that does multi-quanta search to better understand threaded discussions. This talk will cover how HBase made it possible for two self-funded guys to build a new kind of search platform. We will discuss our data model and how we use row based atomicity to manage parallel data integration problems. We’ll also talk about where we don’t use HBase and instead use a traditional SQL based infrastructure. We’ll cover the benefits of using MapReduce and HBase for index generation. Then we’ll cover our migration of some tasks from a message based queue to the Coprocessor framework as well as our future Coprocessor use cases. Finally, we’ll talk briefly about our operational experience with HBase, our hardware choices and challenges we’ve had.

Presented by: Jeremy Carroll, Pinterest

Technology

Operations
Jeremy Carroll
Operations Engineer
HBaseCon 2013

We help people discover things they love
and inspire them to do those things…

HBase in Production
Overview
• All running on Amazon Web Services
• 5 production clusters and growing
• Mix SSD and SATA clusters
• Billions of page views per month

With lots of patches
Designing for EC2
• CDH 4.2.x
• HDFS-3912
• HBase 0.94.7
• HBASE-8284
• One zone per cluster / no rack locality
• RegionServers - Ephemeral disk only
• Redundant clusters for availability
• HDFS-4721 • HDFS-3703 • HDFS-9503
• HBASE-8389• HBASE-8434 • HBASE-7878

Conﬁguration
Cluster Setup
• Managed splitting w/pre split tables
• Bloom filters for pretty much everything
• Manual / Rolling major compactions
• Reverse DNS on EC2
• 3 ZooKeepers in quorum
• 1 NameNode / Sec-NameNode / Master
• 1 EBS volume for fsImage / 1 Elastic IP
• 10-50 nodes per cluster

Fact-driven “Fry” method using Puppet
Provisioning
• User-data passed in to drive config management
• Repackaged modifications to HDFS / HBase
• Ubuntu .deb packages created with FPM
• Synced to S3, nodes configured with s3-apt plugin
• Mount + format ephemerals on boot
• Ext4 / nodiratime / nodealloc / lazy_itable_init

$---- HBASE MODULE ---- class { 'hbase': cluster => 'feeds_e', namenode => 'ec2-xxx-xxx-xxx-xxx.compute-1.amazonaws.com', zookeeper_quorum => 'zk1,zk2,zk3', hbase_site_opts => { 'hbase.replication' => true, 'hbase.snapshot.enabled' => true, 'hbase.snapshot.region.timeout' => '35000', 'replication.sink.client.ops.timeout' => '20000', 'replication.sink.client.retries.number' => '3', 'replication.source.size.capacity' => '4194304', 'replication.source.nb.capacity' => '100', ... } } ---- FACT BASED VARIABLES ---- $hbase_heap_size = $ec2_instance_type ? { 'hi1.4xlarge' => '24000', 'm2.2xlarge' => '24000', 'm2.xlarge' => '11480', 'm1.xlarge' => '11480', 'm1.large' => '6500', ... } Puppet Module$

Designed for EC2
Service Monitoring
• Wounded (dying) vs Operational
• High value metrics first
• Overall health
• Alive / dead nodes
• Service up/down
• Fsck / Blocks / % Space
• Replication status
• Regions needing splits
• fsImage checkpoint
• Zookeeper quorum
• Synthetic transactions (get / put)
• Queues (flush / compaction / rpc)
• Latency (client / filesystem)

Instrumentation
Metrics
• OpenTSDB for high cardinality metrics
• Per region stats collection
• tCollector
• RegionServer HTTP JMX
• HBase REST
• GangliaContext for hadoop-metrics

OpenTSDB
Table RegionServer Region
Slicing and Dicing

S3 + HBase Snapshots
Backups
• Full NameNode backup every 60 mins
• EBS Volume as an name.dir for crash recovery
• HBase snapshots + ExportSnapShot

Additional Tuning
Solid State Clusters
• Lower block size down from 32k
• Something a lot smaller. 8-16k
• Placement groups for 10Gb networking
• Increase DFSBandwidthPerSec
• Kernel tuning for TCP
• Compaction threads
• Disk elevator to noop

Process
Planning for Launch
• Pyres queue asynchronous reads / writes
• Allows for tuning a system before it goes live
• Tuning
• Schema
• Hot spots
• Compaction
• Canary roll out to new users
• 10% -> 30% -> 80% -> 100%

What's hot

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

NoSQL: Cassadra vs. HBase

Antonio Severien

HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...

Speaker: Sudarshan Kadambi and Matthew Hunt (Bloomberg LP) Bloomberg is a financial data and analytics provider, so data management is core to what we do. There's tremendous diversity in the type of data we manage, and HBase is a natural fit for many of these datasets - from the perspective of the data model as well as in terms of a scalable, distributed database. This talk covers data and analytics use cases at Bloomberg and operational challenges around HA. We'll explore the work currently being done under HBASE-10070, further extensions to it, and how this solution is qualitatively different to how failover is handled by Apache Cassandra.

HBase at Bloomberg: High Availability Needs for the Financial Industry

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

The newly added feature of Coprocessors within HBase allows the application designer to move functionality closer to where the data resides. While this sounds like Stored Procedures as known in the RDBMS realm, they have a different set of properties. The distributed nature of HBase adds to the complexity of their implementation, but the client side API allows for an easy, transparent access to their functionality across many servers. This session explains the concepts behind coprocessors and uses examples to show how they can be used to implement data side extensions to the application code.

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Nitin Verma, Pravin Mittal, and Maxim Lukiyanov (Microsoft) This session presents our success story of enabling a big internal customer on Microsoft Azure’s HBase service along with the methodology and tools used to meet high-throughput goals. We will also present how new features in HBase (like BucketCache and MultiWAL) are helping our customers in the medium-latency/high-bandwidth cloud-storage scenario.

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

Apache HDFS, the file system on which HBase is most commonly deployed, was originally designed for high-latency high-throughput batch analytic systems like MapReduce. Over the past two to three years, the rising popularity of HBase has driven many enhancements in HDFS to improve its suitability for real-time systems, including durability support for write-ahead logs, high availability, and improved low-latency performance. This talk will give a brief history of some of the enhancements from Hadoop 0.20.2 through 0.23.0, discuss some of the most exciting work currently under way, and explore some of the future enhancements we expect to develop in the coming years. We will include both high-level overviews of the new features as well as practical tips and benchmark results from real deployments.

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Elastic HBase on Mesos - HBaseCon 2015

Cosmin Lehene

Apache HBase in the Enterprise Data Hub at Cerner

Solbase is an exciting new open-source, real-time search engine being developed at Photobucket to service the over 30 million daily search requests Photobucket handles. Solbase replaces Lucene’s file system-based index with HBase. This allows the system to update in real-time and linearly scale to serve millions of daily search requests on a large dataset. This session will explore the architecture of Solbase as well as some of Lucene/Solr’s inherent issues we overcame. Finally, we’ll go over performance metrics of Solbase against production traffic.

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

Speakers: Enis Soztutar and Devaraj Das (Hortonworks) HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon 2015: HBase and Spark

HBaseCon 2015: State of HBase Docs and How to Contribute

Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.

HBaseCon 2015- HBase @ Flipboard

Matthew Blair

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing us to decouple the RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more. In this talk, you'll hear what we've learned and explain why this approach could fundamentally change HBase operations.

HBaseCon 2015: Elastic HBase on Mesos

Speaker: Vladimir Rodionov (bigbase.org) This talks introduces a totally new implementation of a multilayer caching in HBase called BigBase. BigBase has a big advantage over HBase 0.94/0.96 because of an ability to utilize all available server RAM in the most efficient way, and because of a novel implementation of a L3 level cache on fast SSDs. The talk will show that different type of caches in BigBase work best for different type of workloads, and that a combination of these caches (L1/L2/L3) increases the overall performance of HBase by a very wide margin.

HBase: Extreme Makeover

The AOL Mail Team will discuss our implementation of HBase for two large scale applications: an anti-abuse mechanism and a user-visible API. We will provide an overview of how and why HBase and Hadoop were incorporated into the massive and diverse technology stack that is the nearly 20-year-old AOL Mail system and the history of how we took our HBase/Hadoop apps through our traditional process of design, to development, through QA, and into production. We will explain how our practical approach to HBase has evolved over time, and we will discuss our lessons learned and some of our techniques and tools developed via our iterative dev/qa and operational processes. We will explain the pain-points we have experienced with erratic usage and edge-cases, and how we address problems when we run across them.

What's hot (20)

HBaseCon 2012 | Building a Large Search Platform on a Shoestring Budget

NoSQL: Cassadra vs. HBase

HBaseCon 2015 General Session: Zen - A Graph Data Model on HBase

HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...

HBase at Bloomberg: High Availability Needs for the Financial Industry

HBaseCon 2013: Project Valta - A Resource Management Layer over Apache HBase

HBaseCon 2012 | HBase Coprocessors – Deploy Shared Functionality Directly on ...

Optimizing Apache HBase for Cloud Storage in Microsoft Azure HDInsight

HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera

Elastic HBase on Mesos - HBaseCon 2015

Apache HBase in the Enterprise Data Hub at Cerner

HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket

HBase Read High Availability Using Timeline-Consistent Region Replicas

HBaseCon 2015: HBase and Spark

HBaseCon 2015: State of HBase Docs and How to Contribute

HBaseCon 2015- HBase @ Flipboard

HBase Tales From the Trenches - Short stories about most common HBase operati...

HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase

HBaseCon 2015: Elastic HBase on Mesos

HBase: Extreme Makeover

Viewers also liked

HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data

HBase 1.0 is the new stable major release, and the start of "semantic versioned" releases. We will cover new features, changes in behavior and requirements, source/binary and wire compatibility details, and upgrading. We'll also dive deep into the new standardized client API in 1.0, which establishes a separation of concerns, encapsulates what is needed from how it's delivered, and guarantees future compatibility while freeing the implementation to evolve.

HBaseCon 2015: Meet HBase 1.0

HBaseCon 2013: Apache HBase Table Snapshots

HBaseCon 2013: How to Get the MTTR Below 1 Minute and More

Scaling geospatial data is hard. State of the art GIS technologies available to the general public are locked in the realm of relational databases with PostGIS as the prominent leader. Though a number of location-based startups have walked this path before, few have marked their trail along the way. Act one proveds a survey of the landscape, defining terms, and highlighting pitfalls. Act two explores the world of open source, horizontally scalable GIS and outlines the problems they solve. Act three explores implementations backed by HBase. No previous GIS knowledge is required.

HBaseCon 2012 | Scaling GIS In Three Acts

WorldCat is the world’s largest network of library content and services. Over 25,000 libraries in 170 countries have cooperated for 40 years to build WorldCat. OCLC is currently in the process of transitioning Worldcat from Oracle to Apache HBase. This session will discuss our data design for representing the constantly changing ownership information for thousands of libraries (billions of data points, millions of daily updates) and our plans for how we’re managing HBase in an environment that is equal parts end user facing and batch.

HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

HBase application developers face a number of challenges: schema management is performed at the application level, decoupled components of a system can break one another in unexpected ways, less-technical users cannot easily access data, and evolving data collection and analysis needs are difficult to plan for. In this talk, we describe a schema management methodology based on Apache Avro that enables users and applications to share data in HBase in a scalable, evolvable fashion. By adopting these practices, engineers independently using the same data have guarantees on how their applications interact. As data collection needs change, applications are resilient to drift in the underlying data representation. This methodology results in a data dictionary that allows less-technical users to understand what data is available to them for analysis and inspect data using general-purpose tools (for example, export it via Sqoop to an RDBMS). And because of Avro’s cross-language capabilities, HBase’s power can reach new domains, like web apps built in Ruby.

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...

HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.

Trafodion, open sourced by HP, reflects 20+ years of investment in a full-fledged RDBMS built on Tandem's OLTP heritage and geared towards a wide set of mixed query workloads. In this talk, we will discuss how HP integrated Trafodion with HBase to take full advantage of the Trafodion database engine and the HBase storage engine, covering 3-tier architecture, storage, salting/partitioning, data movement, and more.

HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase

HBaseCon 2013: Apache HBase on Flash

HBaseCon 2013: 1500 JIRAs in 20 Minutes

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...

For Map/Reduce programmers used to HDFS, the mutability of HBase tables poses new challenges: Data can change over the duration of a job, multiple jobs can write concurrently, writes are effective immediately, and it is not trivial to clean up partial writes. Revision Manager introduces atomic commits and point-in-time consistent snapshots over a table, guaranteeing repeatable reads and protection from partial writes. Revision Manager is optimized for a relatively small number of concurrent write jobs, which is typical within Hadoop clusters. This session will discuss the implementation of Revision Manager using ZooKeeper and coprocessors, and paying extra care to ensure security in multi-tenant clusters. Revision Manager is available as part of the HBase storage handler in HCatalog, but can easily be used stand-alone with little coding effort.

HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!

In this session you will learn the common mistakes made when deploying a high write environment when building an analytics database in HBase, as well as tips on how to diagnose and debug performance bottlenecks, and an overview of an open source monitoring utility developed at Urban Airship for finding HBase hotspots. This session will also present a case study on how Urban Airship replaced a tag system running on a highly sharded PostgreSQL cluster to HBase, the options explored to create a high throughput Boolean tag system and how it was ultimately built on HBase.

HBaseCon 2012 | Building Mobile Infrastructure with HBase

Speakers: Jingcheng Du and Ramkrishna Vasudevan (Intel) As HBase continues to expand in application and enterprise or government deployments, there is a growing demand for storing data across geographically distributed datacenters for improved availability and disaster recovery. The Cross-Site BigTable extends HBase to make it well-suited for such deployments, providing the capabilities of creating and accessing HBase tables that are partitioned and asynchronously backed-up over a number of distributed datacenters. This talk reveals how the Cross-Site BigTable manages data access over multiple datacenters and removes the data center itself as a single point of failure in geographically distributed HBase deployments.

Cross-Site BigTable using HBase

HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN

HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics

This session is a case study of how we used our already existing HBase cluster as content addressable storage for BLOBs. We will discuss how we wrote a CAS implementation using HBase as the backend, Scala and Finagle as the application and using caching reverse proxies (i.e. Varnish in our case) for serving BLOBs at scale. The talk will dicuss why content addressable storage is the right pattern for many web use cases, how to foster an already existing HBase cluster for better usage of possibly underutilized resources, and operational gotchas to store and serve BLOBs from HBase at scale.

HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...

HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...

Determining the number of unique users that have interacted with a web page, game, or application is a very common use case. HBase is becoming an increasingly accepted tool for calculating sets or counts of unique individuals who meet some criteria. Computing these statistics can range in difficulty from very simple to very difficult. This session will explore how different approaches have worked or not worked at scale for counting uniques on HBase with Hadoop.

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

Terraform is a tool for building and safely iterating on infrastructure, while Consul provides service discovery, monitoring and orchestration. In this talk we discuss using Terraform and Consul together to build a Docker-based Service Oriented Architecture at scale. We use Consul to provide the runtime control plane for the datacenter, and Terraform is used to modify the underlying infrastructure to allow for elastic scalability.

Viewers also liked (20)

HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data

HBaseCon 2015: Meet HBase 1.0

HBaseCon 2013: Apache HBase Table Snapshots

HBaseCon 2013: How to Get the MTTR Below 1 Minute and More

HBaseCon 2012 | Scaling GIS In Three Acts

HBaseCon 2012 | HBase for the Worlds Libraries - OCLC

HBaseCon 2012 | Living Data: Applying Adaptable Schemas to HBase - Aaron Kimb...

HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.

HBaseCon 2015: Trafodion - Integrating Operational SQL into HBase

HBaseCon 2013: Apache HBase on Flash

HBaseCon 2013: 1500 JIRAs in 20 Minutes

HBaseCon 2012 | Leveraging HBase for the World’s Largest Curated Genomic Data...

HBaseCon 2012 | Relaxed Transactions for HBase - Francis Liu, Yahoo!

HBaseCon 2012 | Building Mobile Infrastructure with HBase

Cross-Site BigTable using HBase

HBaseCon 2015: DeathStar - Easy, Dynamic, Multi-tenant HBase via YARN

HBaseCon 2013: Apache Hadoop and Apache HBase for Real-Time Video Analytics

HBaseCon 2012 | Content Addressable Storages for Fun and Profit - Berk Demir,...

HBaseCon 2013: Evolving a First-Generation Apache HBase Deployment to Second...

HBaseCon 2012 | Unique Sets on HBase and Hadoop - Elliot Clark, StumbleUpon

Similar to HBaseCon 2013: Apache HBase Operations at Pinterest

Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement

VMware Tanzu

Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto

Docker, Inc.

Sum209

jmcAustin

How does Apache Pegasus (incubating) community develop at SensorsData

acelyc1112009

We have seen rapid adoption of C* at eBay in past two years. We have made tremendous efforts to integrate C* into existing database platforms, including Oracle, MySQL, Postgres, MongoDB, XMP etc.. We also scale C* to meet business requirement and encountered technical challenges you only see at eBay scale, 100TB data on hundreds of nodes. We will share our experience of deployment automation, managing, monitoring, reporting for both Apache Cassandra and DataStax enterprise.

C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar

DataStax Academy

AWS Database Services-Philadelphia AWS User Group-4-17-2018

Bert Zahniser

Apache Hive is a rapidly evolving project which continues to enjoy great adoption in the big data ecosystem. As Hive continues to grow its support for analytics, reporting, and interactive query, the community is hard at work in improving it along with many different dimensions and use cases. This talk will provide an overview of the latest and greatest features and optimizations which have landed in the project over the last year. Materialized views, the extension of ACID semantics to non-ORC data, and workload management are some noteworthy new features. We will discuss optimizations which provide major performance gains as well as integration with other big data technologies such as Apache Spark, Druid, and Kafka. The talk will also provide a glimpse of what is expected to come in the near future.

What's New in Apache Hive

European Collaboration Summit

ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...

Redshift overview

Amazon Web Services LATAM

In enterprise on-premises data center, we may have multiple Secured Hadoop clusters for different purpose. Sometimes, these Hadoop clusters might have different Hadoop distribution, Hadoop version, or even locat in different Data Center. To fulfill business requirement, data synchronize between these clusters could be an important mechanism. However, the story will be more complicated within the real world secured multi-cluster, compare to distcp between two same version and non-secured Hadoop clusters. We would like to go through our experience on enable live data synchronization for mutiple kerberos enabled Hadoop clusters. Which include the functionality verification, multi-cluster configurations and automation setup process, etc. After that, we would share the use cases among those kerberos federated Hadoop clusters. Finally, provide our common practice on multi-cluster data synchronization.

HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop

Yafang Chang

Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile). To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.

Performance Optimizations in Apache Impala

Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to Amazon EMR in order to save costs, increase availability, and improve performance. Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of over 15 open-source frameworks in the Apache Hadoop and Spark ecosystems. This session will focus on identifying the components and workflows in your current environment and providing the best practices to migrate these workloads to Amazon EMR. We will explain how to move from HDFS to Amazon S3 as a durable storage layer, and how to lower costs with Amazon EC2 Spot instances and Auto Scaling. Additionally, we will go over common security recommendations and tuning tips to accelerate the time to production.

BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR

Amazon Web Services

HBaseConAsia2018 Track3-2: HBase at China Telecom

Michael Stack

Rigorous and Multi-tenant HBase Performance

How is it that one system can query terabytes of data, yet still provide interactive query support? This talk will discuss two of the underlying technologies that allow Apache Hive to support fast query response, both on-premise in HDFS and in cloud object stores such as S3 and WASB. LLAP was introduced in Hive 2.6. It provides standing processes that securely cache Hive’s columnar data and can do query processing without ever needing to start tasks in Hadoop. We will cover LLAP’s architecture, intended uses cases, and performance numbers for both on-premise and in the cloud. The second technology is the integration of Hive with Apache Druid. Druid excels at low-latency, interactive queries over streaming data. Its method of storing data makes it very well suited for OLAP style queries. We will cover how Hive can be integrated with Druid to support real-time streaming of data from Kafka and OLAP queries.

Kognitio - an overview

Kognitio

Open stack ha design & deployment kilo

Steven Li

Fast SQL on Hadoop, really?

Cloudera Impala is an open-source under Apache Licence enable real-time, interactive analytical SQL queries of the data stored in HBase or HDFS. The work was inspired by Google Dremel paper which is also the basis for Google BigQuery. It provide access same unified storage platform base on it's own distributed query engine but does not use mapreduce. In addition, it use also the same metadata, SQL syntax (HiveQL-like) ODBC driver and user interface (Hue Beeswax) as Hive. Besides the traditional Hadoop approach, aim to provide low-cost solution for resiliency and batch-oriented distributed data processing, we found more and more effort in the Big Data world pursuing the right solution for ad-hoc, fast queries and realtime data processing for large datasets. In this presentation, we'll explore how to run interactive queries inside Impala, advantages of the approach, architecture and understand how it optimizes data systems including also practical performance analysis.

Real-time Big Data Analytics Engine using Impala

Jason Shih

The event, held on 11th December 2018, was a technical presentation about running MS SQL Server 2017 on Linux. We started off by using containers and proceeded in looking at High Availability and Data Protection, more specifically: - Supported features & Linux differences - Installing SQL Server on a Linux Container - Accessing SMB 3.0 shared storage using Samba - Setting up a Fail over Cluster using Pacemaker - Setting up AlwaysOn Availability Groups using Pacemaker - Authenticating to SQL Server using AD Authentication - Setting up Read-Scale Cross-Platform Availability Groups https://techspark.mt/sql-server-on-linux-11th-december-2018/

Tech-Spark: SQL Server on Linux

Ralph Attard

Apache Ambari manages Hadoop at large-scale and it becomes increasingly difficult for cluster admins to keep the machinery running smoothly as data grows and nodes scale from 30 to 3000 agents. To test at scale, Ambari has a Performance Stack that allows a VM to host as many as 50 Ambari Agents. The simulated stack and 50 Agents per VM can stress-test Ambari Server with the same load as a 3000 node cluster. This talk will cover how to tune the performance of Ambari and MySQL, and share performance benchmarks for features like deploy times, bulk operations, installation of bits, Rolling & Express Upgrade. Moreover, the speaker will show how to use Ambari Metrics System and Grafana to plot performance, detect anomalies, and pinpoint tips on how to improve performance for a more responsive experience. Lastly, the talk will discuss roadmap features in Ambari 3.0 for improving performance and scale.

Tuning Apache Ambari performance for Big Data at scale with 3000 agents

Similar to HBaseCon 2013: Apache HBase Operations at Pinterest (20)

Slides for the Apache Geode Hands-on Meetup and Hackathon Announcement

Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto

Sum209

How does Apache Pegasus (incubating) community develop at SensorsData

C* Summit 2013: Cassandra at eBay Scale by Feng Qu and Anurag Jambhekar

AWS Database Services-Philadelphia AWS User Group-4-17-2018

What's New in Apache Hive

ECS19 - Patrick Curran, Eric Shupps - SHAREPOINT 24X7X365: ARCHITECTING FOR H...

Redshift overview

HadoopCon2015 Multi-Cluster Live Synchronization with Kerberos Federated Hadoop

Performance Optimizations in Apache Impala

BDA 302 Deep Dive on Migrating Big Data Workloads to Amazon EMR

HBaseConAsia2018 Track3-2: HBase at China Telecom

Rigorous and Multi-tenant HBase Performance

Kognitio - an overview

Open stack ha design & deployment kilo

Fast SQL on Hadoop, really?

Real-time Big Data Analytics Engine using Impala

Tech-Spark: SQL Server on Linux

Tuning Apache Ambari performance for Big Data at scale with 3000 agents

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx

This annual program recognizes organizations who are moving swiftly towards the future and building innovative solutions by making what was impossible yesterday, possible today. The winning organizations' implementations demonstrate outstanding achievements in fulfilling their mission, technical advancement, and overall impact. The 2021 Data Impact Awards recognize organizations' achievements with the Cloudera Data Platform in seven categories: Data Lifecycle Connection Data for Enterprise AI Cloud Innovation Security & Governance Leadership People First Data for Good Industry Transformation

Cloudera Data Impact Awards 2021 - Finalists

Cloudera is proud to present the 2020 Data Impact Awards Finalists. This annual program recognizes organizations running the Cloudera platform for the applications they've built and the impact their data projects have on their organizations, their industries, and the world. Nominations were evaluated by a panel of independent thought-leaders and expert industry analysts, who then selected the finalists and winners. Winners exemplify the most-cutting edge data projects and represent innovation and leadership in their respective industries.

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18