Apache HBase is a distributed data store that is in production today at many enterprises and sites serving large volumes of near-real-time random-accesses. By supporting a wide range of production Apache HBase clusters with diverse use cases and sizes over the past year, we?ve noticed several new trends, learned lessons, and taken action to improve the HBase experience. We?ll present aggregated root-cause statistics on resolved support tickets from the past year. The comparison between this and the previous year?s shows an interesting shift away from problems internal to HBase (splitting, repairs, recovery time) that skews towards user-inflicted problems like poor application architecture level that can be mitigated by tuning (bulk load, r/w latencies and compaction policies). The talk will discuss several tuning tips used for a variety of production workloads running on top of the HBase 0.92.x/0.94.x clusters with 10s to 100s of nodes. This will include settings and their justification for sizing clusters, tuning bulk loads, region counts, and memory settings. We?ll also discuss recently added HBase features that alleviate these problems including an improved mean time to recovery, improved predictability, and improved performance.
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
Walk through some of the best practices to keep in mind when it comes to upgrading your cluster, and learn how to leverage new Upgrade Wizard features in Cloudera Enterprise 5.3.
For most mission critical workloads, downtime is never an option. Any downtime can have a direct impact on revenue and lead to frantic calls in the middle of the night. For this reason, upgrading the software that powers these workloads can often be a daunting task. It can cause unpredictable issues without access to support. That’s why an enterprise-grade administration tool is crucial for running Hadoop in production. Hadoop consists of dozens of components, running across multiple machines, all with their own configurations. That can lead to a lot of complexity and uncertainty - especially when taking the upgrade plunge.
Cloudera Manager makes it easy and is the only production-ready administration tool for Hadoop. Not only does Cloudera Manager feature zero-downtime rolling upgrades, but it also has a built in Upgrade Wizard to make upgrades simple and predictable.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
What the Enterprise Requires - Business Continuity and VisibilityCloudera, Inc.
Cloudera Enterprise BDR delivers centralized disaster recovery for data and metadata, enabling you to prepare for disaster by moving data to your secondary site automatically. Cloudera Navigator 1.0 provides data governance capabilities such as verifying access privileges and auditing access to all data stored in Hadoop, which are critical for customers that are in highly regulated industries and have stringent compliance requirements.
This presentation will teach you how to:
- Centrally configure and manage replication workflows for files (HDFS) and metadata (Hive)
- Consistently meet or exceed SLAs and RTOs through simplified management and process automation
- Track access permissions and actual accesses to all data objects in Hive, HBase, and HDFS
- Answer the questions:
- Who has access to which data object(s)
- Which data objects were accessed by a user
- When was a data object accessed and by whom
- What data assets were accessed using a service
- Which device was used to access
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Upgrade Without the Headache: Best Practices for Upgrading Hadoop in ProductionCloudera, Inc.
Walk through some of the best practices to keep in mind when it comes to upgrading your cluster, and learn how to leverage new Upgrade Wizard features in Cloudera Enterprise 5.3.
For most mission critical workloads, downtime is never an option. Any downtime can have a direct impact on revenue and lead to frantic calls in the middle of the night. For this reason, upgrading the software that powers these workloads can often be a daunting task. It can cause unpredictable issues without access to support. That’s why an enterprise-grade administration tool is crucial for running Hadoop in production. Hadoop consists of dozens of components, running across multiple machines, all with their own configurations. That can lead to a lot of complexity and uncertainty - especially when taking the upgrade plunge.
Cloudera Manager makes it easy and is the only production-ready administration tool for Hadoop. Not only does Cloudera Manager feature zero-downtime rolling upgrades, but it also has a built in Upgrade Wizard to make upgrades simple and predictable.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
SQL and Machine Learning on Hadoop using HAWQpivotalny
It is true to the extent it is almost considered rhetorical to say
“Many Enterprises have adopted HDFS as the foundational layer for their Data Lakes. HDFS provides the flexibility to store any kind of data and more importantly it’s infinitely scaleable on commodity hardware.”
But the conundrum till date is the solution for a low latency query engine for HDFS.
At Pivotal, we cracked that problem and the answer is HAWQ, which we intend to open source this year. During this event, we will present and demo HAWQ’s Architecture, it’s powerful ANSI SQL features and it’s ability to transcend traditional BI in the form of in-database analytics (or machine learning).
What the Enterprise Requires - Business Continuity and VisibilityCloudera, Inc.
Cloudera Enterprise BDR delivers centralized disaster recovery for data and metadata, enabling you to prepare for disaster by moving data to your secondary site automatically. Cloudera Navigator 1.0 provides data governance capabilities such as verifying access privileges and auditing access to all data stored in Hadoop, which are critical for customers that are in highly regulated industries and have stringent compliance requirements.
This presentation will teach you how to:
- Centrally configure and manage replication workflows for files (HDFS) and metadata (Hive)
- Consistently meet or exceed SLAs and RTOs through simplified management and process automation
- Track access permissions and actual accesses to all data objects in Hive, HBase, and HDFS
- Answer the questions:
- Who has access to which data object(s)
- Which data objects were accessed by a user
- When was a data object accessed and by whom
- What data assets were accessed using a service
- Which device was used to access
This talk was held at the 11th meeting on April 7 2014 by Marcel Kornacker.
Impala (impala.io) raises the bar for SQL query performance on Apache Hadoop. With Impala, you can query Hadoop data – including SELECT, JOIN, and aggregate functions – in real time to do BI-style analysis. As a result, Impala makes a Hadoop-based enterprise data hub function like an enterprise data warehouse for native Big Data.
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosSajith C P Nair
Presentation delivered by Sajith C P, Integration Architect at the 2017 Global Integration Bootcamp, Bangalore.
https://www.biztalk360.com/gib2017-india/#speakers[inline]/7/
In this session the speaker talked about ‘on-premises data gateway’ as a secure centralized gateway that can be used for accessing on premise data from various Azure Services. He took a deep dive on how it works, how to install and various methods to troubleshoot connectivity. He concluded the session with few demos of its use in Azure Logic App, Microsoft Flow, Power Apps and Power BI.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
HBase provides many features for multi-tenancy and isolation. However, the operation of these features require integration into the broader operations of a cluster. This talk will cover some methods we use at Bloomberg for multi-tenancy and discuss some HBase-Oozie integration. Particularly of interest is our work on an Oozie action for secure snapshot export -- this extends the HBase security model via Oozie allowing self-service (non-hbase user) snapshot export on secure clusters.
Key topics:
* Bloomberg's Oozie HBase export snapshot action
* Oozie coordinated time based major compactions
* How we use LDAP with HBase (and why to take care with HADOOP-12291)
* Some of our multi-tenancy setups around monitoring for SLAs
* Suggesting HBase stays the course of being "just" a datastore -- and all projects following the Unix philosophy (this has made things like our Oozie integration much easier!)
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Presentation: “Big Data and MicroStrategy: Building a Bridge for the Elephant”
Intelligent engineering of an agile business requires the ability to connect the vast array of requirements, technologies and data that build up over time, while avoiding the pitfalls commonly encountered on the road to giving users comprehensive, yet nimble business analytics with MicroStrategy.
The Google generation armed with iPads, Droid Phones bring big bold ideas on how “Big Data” will solve the new wave of business problems; traditional users know that addressing them requires more than just embracing the buzzwords like “sentiment”, “R” and “Hadoop.” Overall success requires building a bridge between the stable, proven, mature BI solutions in place today with the disruptive new world. Enabling deeper analytics, predictive modeling, social media analysis in combination with scalable self-service dashboards, reporting and analytics is no longer an idea but a MUST DO.
This informative presentation describes these business challenges and how an organization leveraged the Kognitio Analytical Platform under MicroStrategy to build such a bridge.
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
Deep Dive - Usage of on premises data gateway for hybrid integration scenariosSajith C P Nair
Presentation delivered by Sajith C P, Integration Architect at the 2017 Global Integration Bootcamp, Bangalore.
https://www.biztalk360.com/gib2017-india/#speakers[inline]/7/
In this session the speaker talked about ‘on-premises data gateway’ as a secure centralized gateway that can be used for accessing on premise data from various Azure Services. He took a deep dive on how it works, how to install and various methods to troubleshoot connectivity. He concluded the session with few demos of its use in Azure Logic App, Microsoft Flow, Power Apps and Power BI.
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
The Hadoop Distributed File System is the foundational storage layer in typical Hadoop deployments. Performance and stability of HDFS are crucial to the correct functioning of applications at higher layers in the Hadoop stack. This session is a technical deep dive into recent enhancements committed to HDFS by the entire Apache contributor community. We describe real-world incidents that motivated these changes and how the enhancements prevent those problems from reoccurring. Attendees will leave this session with a deeper understanding of the implementation challenges in a distributed file system and identify helpful new metrics to monitor in their own clusters.
HBase provides many features for multi-tenancy and isolation. However, the operation of these features require integration into the broader operations of a cluster. This talk will cover some methods we use at Bloomberg for multi-tenancy and discuss some HBase-Oozie integration. Particularly of interest is our work on an Oozie action for secure snapshot export -- this extends the HBase security model via Oozie allowing self-service (non-hbase user) snapshot export on secure clusters.
Key topics:
* Bloomberg's Oozie HBase export snapshot action
* Oozie coordinated time based major compactions
* How we use LDAP with HBase (and why to take care with HADOOP-12291)
* Some of our multi-tenancy setups around monitoring for SLAs
* Suggesting HBase stays the course of being "just" a datastore -- and all projects following the Unix philosophy (this has made things like our Oozie integration much easier!)
Sharing metadata across the data lake and streamsDataWorks Summit
Traditionally systems have stored and managed their own metadata, just as they traditionally stored and managed their own data. A revolutionary feature of big data tools such as Apache Hadoop and Apache Kafka is the ability to store all data together, where users can bring the tools of their choice to process it.
Apache Hive's metastore can be used to share the metadata in the same way. It is already used by many SQL and SQL-like systems beyond Hive (e.g. Apache Spark, Presto, Apache Impala, and via HCatalog, Apache Pig). As data processing changes from only data in the cluster to include data in streams, the metastore needs to expand and grow to meet these use cases as well. There is work going on in the Hive community to separate out the metastore, so it can continue to serve Hive but also be used by a more diverse set of tools. This talk will discuss that work, with particular focus on adding support for storing schemas for Kafka messages.
Speaker
Alan Gates, Co-Founder, Hortonworks
With the advent of Hadoop, there comes the need for professionals skilled in Hadoop Administration making it imperative to be skilled as a Hadoop Admin for better career, salary and job opportunities.
LinkedIn leverages the Apache Hadoop ecosystem for its big data analytics. Steady growth of the member base at LinkedIn along with their social activities results in exponential growth of the analytics infrastructure. Innovations in analytics tooling lead to heavier workloads on the clusters, which generate more data, which in turn encourage innovations in tooling and more workloads. Thus, the infrastructure remains under constant growth pressure. Heterogeneous environments embodied via a variety of hardware and diverse workloads make the task even more challenging.
This talk will tell the story of how we doubled our Hadoop infrastructure twice in the past two years.
• We will outline our main use cases and historical rates of cluster growth in multiple dimensions.
• We will focus on optimizations, configuration improvements, performance monitoring and architectural decisions we undertook to allow the infrastructure to keep pace with business needs.
• The topics include improvements in HDFS NameNode performance, and fine tuning of block report processing, the block balancer, and the namespace checkpointer.
• We will reveal a study on the optimal storage device for HDFS persistent journals (SATA vs. SAS vs. SSD vs. RAID).
• We will also describe Satellite Cluster project which allowed us to double the objects stored on one logical cluster by splitting an HDFS cluster into two partitions without the use of federation and practically no code changes.
• Finally, we will take a peek at our future goals, requirements, and growth perspectives.
SPEAKERS
Konstantin Shvachko, Sr Staff Software Engineer, LinkedIn
Erik Krogen, Senior Software Engineer, LinkedIn
Presentation: “Big Data and MicroStrategy: Building a Bridge for the Elephant”
Intelligent engineering of an agile business requires the ability to connect the vast array of requirements, technologies and data that build up over time, while avoiding the pitfalls commonly encountered on the road to giving users comprehensive, yet nimble business analytics with MicroStrategy.
The Google generation armed with iPads, Droid Phones bring big bold ideas on how “Big Data” will solve the new wave of business problems; traditional users know that addressing them requires more than just embracing the buzzwords like “sentiment”, “R” and “Hadoop.” Overall success requires building a bridge between the stable, proven, mature BI solutions in place today with the disruptive new world. Enabling deeper analytics, predictive modeling, social media analysis in combination with scalable self-service dashboards, reporting and analytics is no longer an idea but a MUST DO.
This informative presentation describes these business challenges and how an organization leveraged the Kognitio Analytical Platform under MicroStrategy to build such a bridge.
Hortonworks Technical Workshop - Operational Best Practices WorkshopHortonworks
Hortonworks Data Platform is a key component of Modern Data Architecture. Organizations rely on HDP for mission critical business functions and expects for the system to be constantly available and performant. In this session we will cover the operational best practices for administering the Hortonworks Data Platform including the initial setup and ongoing maintenance.
Introduction to Kudu - StampedeCon 2016StampedeCon
Over the past several years, the Hadoop ecosystem has made great strides in its real-time access capabilities, narrowing the gap compared to traditional database technologies. With systems such as Impala and Spark, analysts can now run complex queries or jobs over large datasets within a matter of seconds. With systems such as Apache HBase and Apache Phoenix, applications can achieve millisecond-scale random access to arbitrarily-sized datasets.
Despite these advances, some important gaps remain that prevent many applications from transitioning to Hadoop-based architectures. Users are often caught between a rock and a hard place: columnar formats such as Apache Parquet offer extremely fast scan rates for analytics, but little to no ability for real-time modification or row-by-row indexed access. Online systems such as HBase offer very fast random access, but scan rates that are too slow for large scale data warehousing workloads.
This talk will investigate the trade-offs between real-time transactional access and fast analytic performance from the perspective of storage engine internals. It will also describe Kudu, the new addition to the open source Hadoop ecosystem that fills the gap described above, complementing HDFS and HBase to provide a new option to achieve fast scans and fast random access from a single API.
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Chicago Data Summit: Geo-based Content Processing Using HBaseCloudera, Inc.
NAVTEQ uses Cloudera Distribution including Apache Hadoop (CDH) and HBase with Cloudera Enterprise support to process and store location content data. With HBase and its distributed and column-oriented architecture, NAVTEQ is able to process large amounts of data in a scalable and cost-effective way.
Fundamentals of Big Data, Hadoop project design and case study or Use case
General planning consideration and most necessaries in Hadoop ecosystem and Hadoop projects
This will provide the basis for choosing the right Hadoop implementation, Hadoop technologies integration, adoption and creating an infrastructure.
Building applications using Apache Hadoop with a use-case of WI-FI log analysis has real life example.
Multi-Tenant Operations with Cloudera 5.7 & BTCloudera, Inc.
One benefit of Apache Hadoop is the ability to power multiple workloads, across many different users and departments, all within a single, shared cluster. Hear how BT is doing this today and learn about new features in Cloudera Manager to provide better visibility for multi-tenant operations.
Pivotal: Hadoop for Powerful Processing of Unstructured Data for Valuable Ins...EMC
Pivotal has setup and operationalized 1000 node Hadoop cluster called the Analytics Workbench. It takes special setup and skills to manage such a large deployment. This session shares how we set it up and how you will manage it.
Objective 1: Understand what it takes to operationalize a 1000-nodeHadoop cluster.
After this session you will be able to:
Objective 2: Understand how to set up and manage the day to day challenges of a large Hadoop deployments.
Objective 3: Have a view to the tools that are necessary to solve the challenges of managing the large Hadoop cluster.
Hitachi Data Systems Hadoop Solution. Customers are seeing exponential growth of unstructured data from their social media websites to operational sources. Their enterprise data warehouses are not designed to handle such high volumes and varieties of data. Hadoop, the latest software platform that scales to process massive volumes of unstructured and semi-structured data by distributing the workload through clusters of servers, is giving customers new option to tackle data growth and deploy big data analysis to help better understand their business. Hitachi Data Systems is launching its latest Hadoop reference architecture, which is pre-tested with Cloudera Hadoop distribution to provide a faster time to market for customers deploying Hadoop applications. HDS, Cloudera and Hitachi Consulting will present together and explain how to get you there. Attend this WebTech and learn how to: Solve big-data problems with Hadoop. Deploy Hadoop in your data warehouse environment to better manage your unstructured and structured data. Implement Hadoop using HDS Hadoop reference architecture. For more information on Hitachi Data Systems Hadoop Solution please read our blog: http://blogs.hds.com/hdsblog/2012/07/a-series-on-hadoop-architecture.html
Analysis of historical movie data by BHADRABhadra Gowdra
Recommendation system provides the facility to understand a person's taste and find new, desirable content for them automatically based on the pattern between their likes and rating of different items. In this paper, we have proposed a recommendation system for the large amount of data available on the web in the form of ratings, reviews, opinions, complaints, remarks, feedback, and comments about any item (product, event, individual and services) using Hadoop Framework.
Hbase in action - Chapter 09: Deploying HBasephanleson
Hbase in action - Chapter 09: Deploying HBase
Learning HBase, Real-time Access to Your Big Data, Data Manipulation at Scale, Big Data, Text Mining, HBase, Deploying HBase
Introduction: This workshop will provide a hands-on introduction to Machine Learning (ML) with an overview of Deep Learning (DL).
Format: An introductory lecture on several supervised and unsupervised ML techniques followed by light introduction to DL and short discussion what is current state-of-the-art. Several python code samples using the scikit-learn library will be introduced that users will be able to run in the Cloudera Data Science Workbench (CDSW).
Objective: To provide a quick and short hands-on introduction to ML with python’s scikit-learn library. The environment in CDSW is interactive and the step-by-step guide will walk you through setting up your environment, to exploring datasets, training and evaluating models on popular datasets. By the end of the crash course, attendees will have a high-level understanding of popular ML algorithms and the current state of DL, what problems they can solve, and walk away with basic hands-on experience training and evaluating ML models.
Prerequisites: For the hands-on portion, registrants must bring a laptop with a Chrome or Firefox web browser. These labs will be done in the cloud, no installation needed. Everyone will be able to register and start using CDSW after the introductory lecture concludes (about 1hr in). Basic knowledge of python highly recommended.
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
In a world with a myriad of distributed storage systems to choose from, the majority of Apache HBase clusters still rely on Apache HDFS. Theoretically, any distributed file system could be used by HBase. One major reason HDFS is predominantly used are the specific durability requirements of HBase's write-ahead log (WAL) and HDFS providing that guarantee correctly. However, HBase's use of HDFS for WALs can be replaced with sufficient effort.
This talk will cover the design of a "Log Service" which can be embedded inside of HBase that provides a sufficient level of durability that HBase requires for WALs. Apache Ratis (incubating) is a library-implementation of the RAFT consensus protocol in Java and is used to build this Log Service. We will cover the design choices of the Ratis Log Service, comparing and contrasting it to other log-based systems that exist today. Next, we'll cover how the Log Service "fits" into HBase and the necessary changes to HBase which enable this. Finally, we'll discuss how the Log Service can simplify the operational burden of HBase.
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
Utilizing Apache NiFi we read various open data REST APIs and camera feeds to ingest crime and related data real-time streaming it into HBase and Phoenix tables. HBase makes an excellent storage option for our real-time time series data sources. We can immediately query our data utilizing Apache Zeppelin against Phoenix tables as well as Hive external tables to HBase.
Apache Phoenix tables also make a great option since we can easily put microservices on top of them for application usage. I have an example Spring Boot application that reads from our Philadelphia crime table for front-end web applications as well as RESTful APIs.
Apache NiFi makes it easy to push records with schemas to HBase and insert into Phoenix SQL tables.
Resources:
https://community.hortonworks.com/articles/54947/reading-opendata-json-and-storing-into-phoenix-tab.html
https://community.hortonworks.com/articles/56642/creating-a-spring-boot-java-8-microservice-to-read.html
https://community.hortonworks.com/articles/64122/incrementally-streaming-rdbms-data-to-your-hadoop.html
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
Whilst HBase is the most logical answer for use cases requiring random, realtime read/write access to Big Data, it may not be so trivial to design applications that make most of its use, neither the most simple to operate. As it depends/integrates with other components from Hadoop ecosystem (Zookeeper, HDFS, Spark, Hive, etc) or external systems ( Kerberos, LDAP), and its distributed nature requires a "Swiss clockwork" infrastructure, many variables are to be considered when observing anomalies or even outages. Adding to the equation there's also the fact that HBase is still an evolving product, with different release versions being used currently, some of those can carry genuine software bugs. On this presentation, we'll go through the most common HBase issues faced by different organisations, describing identified cause and resolution action over my last 5 years supporting HBase to our heterogeneous customer base.
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
LocationTech GeoMesa enables spatial and spatiotemporal indexing and queries for HBase and Accumulo. In this talk, after an overview of GeoMesa’s capabilities in the Cloudera ecosystem, we will dive into how GeoMesa leverages Accumulo’s Iterator interface and HBase’s Filter and Coprocessor interfaces. The goal will be to discuss both what spatial operations can be pushed down into the distributed database and also how the GeoMesa codebase is organized to allow for consistent use across the two database systems.
OCLC has been using HBase since 2012 to enable single-search-box access to over a billion items from your library and the world’s library collection. This talk will provide an overview of how HBase is structured to provide this information and some of the challenges they have encountered to scale to support the world catalog and how they have overcome them.
Many individuals/organizations have a desire to utilize NoSQL technology, but often lack an understanding of how the underlying functional bits can be utilized to enable their use case. This situation can result in drastic increases in the desire to put the SQL back in NoSQL.
Since the initial commit, Apache Accumulo has provided a number of examples to help jumpstart comprehension of how some of these bits function as well as potentially help tease out an understanding of how they might be applied to a NoSQL friendly use case. One very relatable example demonstrates how Accumulo could be used to emulate a filesystem (dirlist).
In this session we will walk through the dirlist implementation. Attendees should come away with an understanding of the supporting table designs, a simple text search supporting a single wildcard (on file/directory names), and how the dirlist elements work together to accomplish its feature set. Attendees should (hopefully) also come away with a justification for sometimes keeping the SQL out of NoSQL.
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
Data serves as the platform for decision-making at Uber. To facilitate data driven decisions, many datasets at Uber are ingested in a Hadoop Data Lake and exposed to querying via Hive. Analytical queries joining various datasets are run to better understand business data at Uber.
Data ingestion, at its most basic form, is about organizing data to balance efficient reading and writing of newer data. Data organization for efficient reading involves factoring in query patterns to partition data to ensure read amplification is low. Data organization for efficient writing involves factoring the nature of input data - whether it is append only or updatable.
At Uber we ingest terabytes of many critical tables such as trips that are updatable. These tables are fundamental part of Uber's data-driven solutions, and act as the source-of-truth for all the analytical use-cases across the entire company. Datasets such as trips constantly receive updates to the data apart from inserts. To ingest such datasets we need a critical component that is responsible for bookkeeping information of the data layout, and annotates each incoming change with the location in HDFS where this data should be written. This component is called as Global Indexing. Without this component, all records get treated as inserts and get re-written to HDFS instead of being updated. This leads to duplication of data, breaking data correctness and user queries. This component is key to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. This component will need to have strong consistency and provide large throughputs for index writes and reads.
At Uber, we have chosen HBase to be the backing store for the Global Indexing component and is a critical component in allowing us to scaling our jobs where we are now handling greater than 500 billion writes a day in our current ingestion systems. In this talk, we will discuss data@Uber and expound more on why we built the global index using Apache Hbase and how this helps to scale out our cluster usage. We’ll give details on why we chose HBase over other storage systems, how and why we came up with a creative solution to automatically load Hfiles directly to the backend circumventing the normal write path when bootstrapping our ingestion tables to avoid QPS constraints, as well as other learnings we had bringing this system up in production at the scale of data that Uber encounters daily.
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
Recently, Apache Phoenix has been integrated with Apache (incubator) Omid transaction processing service, to provide ultra-high system throughput with ultra-low latency overhead. Phoenix has been shown to scale beyond 0.5M transactions per second with sub-5ms latency for short transactions on industry-standard hardware. On the other hand, Omid has been extended to support secondary indexes, multi-snapshot SQL queries, and massive-write transactions.
These innovative features make Phoenix an excellent choice for translytics applications, which allow converged transaction processing and analytics. We share the story of building the next-gen data tier for advertising platforms at Verizon Media that exploits Phoenix and Omid to support multi-feed real-time ingestion and AI pipelines in one place, and discuss the lessons learned.
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
Cybersecurity requires an organization to collect data, analyze it, and alert on cyber anomalies in near real-time. This is a challenging endeavor when considering the variety of data sources which need to be collected and analyzed. Everything from application logs, network events, authentications systems, IOT devices, business events, cloud service logs, and more need to be taken into consideration. In addition, multiple data formats need to be transformed and conformed to be understood by both humans and ML/AI algorithms.
To solve this problem, the Aetna Global Security team developed the Unified Data Platform based on Apache NiFi, which allows them to remain agile and adapt to new security threats and the onboarding of new technologies in the Aetna environment. The platform currently has over 60 different data flows with 95% doing real-time ETL and handles over 20 billion events per day. In this session learn from Aetna’s experience building an edge to AI high-speed data pipeline with Apache NiFi.
In the healthcare sector, data security, governance, and quality are crucial for maintaining patient privacy and ensuring the highest standards of care. At Florida Blue, the leading health insurer of Florida serving over five million members, there is a multifaceted network of care providers, business users, sales agents, and other divisions relying on the same datasets to derive critical information for multiple applications across the enterprise. However, maintaining consistent data governance and security for protected health information and other extended data attributes has always been a complex challenge that did not easily accommodate the wide range of needs for Florida Blue’s many business units. Using Apache Ranger, we developed a federated Identity & Access Management (IAM) approach that allows each tenant to have their own IAM mechanism. All user groups and roles are propagated across the federation in order to determine users’ data entitlement and access authorization; this applies to all stages of the system, from the broadest tenant levels down to specific data rows and columns. We also enabled audit attributes to ensure data quality by documenting data sources, reasons for data collection, date and time of data collection, and more. In this discussion, we will outline our implementation approach, review the results, and highlight our “lessons learned.”
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Airbnb, Bloomberg, Comcast, Facebook, FINRA, LinkedIn, Lyft, Netflix, Twitter, and Uber, in the last few years Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments over Object Stores, HDFS, NoSQL and RDBMS data stores.
With the ever-growing list of connectors to new data sources such as Azure Blob Storage, Elasticsearch, Netflix Iceberg, Apache Kudu, and Apache Pulsar, recently introduced Cost-Based Optimizer in Presto must account for heterogeneous inputs with differing and often incomplete data statistics. This talk will explore this topic in detail as well as discuss best use cases for Presto across several industries. In addition, we will present recent Presto advancements such as Geospatial analytics at scale and the project roadmap going forward.
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
Specialized tools for machine learning development and model governance are becoming essential. MlFlow is an open source platform for managing the machine learning lifecycle. Just by adding a few lines of code in the function or script that trains their model, data scientists can log parameters, metrics, artifacts (plots, miscellaneous files, etc.) and a deployable packaging of the ML model. Every time that function or script is run, the results will be logged automatically as a byproduct of those lines of code being added, even if the party doing the training run makes no special effort to record the results. MLflow application programming interfaces (APIs) are available for the Python, R and Java programming languages, and MLflow sports a language-agnostic REST API as well. Over a relatively short time period, MLflow has garnered more than 3,300 stars on GitHub , almost 500,000 monthly downloads and 80 contributors from more than 40 companies. Most significantly, more than 200 companies are now using MLflow. We will demo MlFlow Tracking , Project and Model components with Azure Machine Learning (AML) Services and show you how easy it is to get started with MlFlow on-prem or in the cloud.
Extending Twitter's Data Platform to Google CloudDataWorks Summit
Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
At Comcast, our team has been architecting a customer experience platform which is able to react to near-real-time events and interactions and deliver appropriate and timely communications to customers. By combining the low latency capabilities of Apache Flink and the dataflow capabilities of Apache NiFi we are able to process events at high volume to trigger, enrich, filter, and act/communicate to enhance customer experiences. Apache Flink and Apache NiFi complement each other with their strengths in event streaming and correlation, state management, command-and-control, parallelism, development methodology, and interoperability with surrounding technologies. We will trace our journey from starting with Apache NiFi over three years ago and our more recent introduction of Apache Flink into our platform stack to handle more complex scenarios. In this presentation we will compare and contrast which business and technical use cases are best suited to which platform and explore different ways to integrate the two platforms into a single solution.
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
Companies are increasingly moving to the cloud to store and process data. One of the challenges companies have is in securing data across hybrid environments with easy way to centrally manage policies. In this session, we will talk through how companies can use Apache Ranger to protect access to data both in on-premise as well as in cloud environments. We will go into details into the challenges of hybrid environment and how Ranger can solve it. We will also talk through how companies can further enhance the security by leveraging Ranger to anonymize or tokenize data while moving into the cloud and de-anonymize dynamically using Apache Hive, Apache Spark or when accessing data from cloud storage systems. We will also deep dive into the Ranger’s integration with AWS S3, AWS Redshift and other cloud native systems. We will wrap it up with an end to end demo showing how policies can be created in Ranger and used to manage access to data in different systems, anonymize or de-anonymize data and track where data is flowing.
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
Advanced Big Data Processing frameworks have been proposed to harness the fast data transmission capability of Remote Direct Memory Access (RDMA) over high-speed networks such as InfiniBand, RoCEv1, RoCEv2, iWARP, and OmniPath. However, with the introduction of the Non-Volatile Memory (NVM) and NVM express (NVMe) based SSD, these designs along with the default Big Data processing models need to be re-assessed to discover the possibilities of further enhanced performance. In this talk, we will present, NRCIO, a high-performance communication runtime for non-volatile memory over modern network interconnects that can be leveraged by existing Big Data processing middleware. We will show the performance of non-volatile memory-aware RDMA communication protocols using our proposed runtime and demonstrate its benefits by incorporating it into a high-performance in-memory key-value store, Apache Hadoop, Tez, Spark, and TensorFlow. Evaluation results illustrate that NRCIO can achieve up to 3.65x performance improvement for representative Big Data processing workloads on modern data centers.
Background: Some early applications of Computer Vision in Retail arose from e-commerce use cases - but increasingly, it is being used in physical stores in a variety of new and exciting ways, such as:
● Optimizing merchandising execution, in-stocks and sell-thru
● Enhancing operational efficiencies, enable real-time customer engagement
● Enhancing loss prevention capabilities, response time
● Creating frictionless experiences for shoppers
Abstract: This talk will cover the use of Computer Vision in Retail, the implications to the broader Consumer Goods industry and share business drivers, use cases and benefits that are unfolding as an integral component in the remaking of an age-old industry.
We will also take a ‘peek under the hood’ of Computer Vision and Deep Learning, sharing technology design principles and skill set profiles to consider before starting your CV journey.
Deep learning has matured considerably in the past few years to produce human or superhuman abilities in a variety of computer vision paradigms. We will discuss ways to recognize these paradigms in retail settings, collect and organize data to create actionable outcomes with the new insights and applications that deep learning enables.
We will cover the basics of object detection, then move into the advanced processing of images describing the possible ways that a retail store of the near future could operate. Identifying various storefront situations by having a deep learning system attached to a camera stream. Such things as; identifying item stocks on shelves, a shelf in need of organization, or perhaps a wandering customer in need of assistance.
We will also cover how to use a computer vision system to automatically track customer purchases to enable a streamlined checkout process, and how deep learning can power plausible wardrobe suggestions based on what a customer is currently wearing or purchasing.
Finally, we will cover the various technologies that are powering these applications today. Deep learning tools for research and development. Production tools to distribute that intelligence to an entire inventory of all the cameras situation around a retail location. Tools for exploring and understanding the new data streams produced by the computer vision systems.
By the end of this talk, attendees should understand the impact Computer Vision and Deep Learning are having in the Consumer Goods industry, key use cases, techniques and key considerations leaders are exploring and implementing today.
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
Whole genome shotgun based next generation transcriptomics and metagenomics studies often generate 100 to 1000 gigabytes (GB) sequence data derived from tens of thousands of different genes or microbial species. De novo assembling these data requires an ideal solution that both scales with data size and optimizes for individual gene or genomes. Here we developed an Apache Spark-based scalable sequence clustering application, SparkReadClust (SpaRC), that partitions the reads based on their molecule of origin to enable downstream assembly optimization. SpaRC produces high clustering performance on transcriptomics and metagenomics test datasets from both short read and long read sequencing technologies. It achieved a near linear scalability with respect to input data size and number of compute nodes. SpaRC can run on different cloud computing environments without modifications while delivering similar performance. In summary, our results suggest SpaRC provides a scalable solution for clustering billions of reads from the next-generation sequencing experiments, and Apache Spark represents a cost-effective solution with rapid development/deployment cycles for similar big data genomics problems.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 3
Trends in Supporting Production Apache HBase Clusters
1. Headline Goes Here
Speaker Name or Subhead Goes Here
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
Trends in Supporting Production
Apache HBase Clusters
Jonathan Hsieh | @jmhsieh | Software Engineer at Cloudera /
HBase PMC Member
Kevin O’Dell| kevin.odell@cloudera| Systems Engineer at Cloudera
June 26, 2013
2. Who are we?
Jonathan Hsieh
• Cloudera:
• Software Engineer
• Apache HBase committer /
PMC
• Apache Flume founder
Kevin O’Dell
• Cloudera:
• Systems Engineer
• Apache HBase contributor
• Cloudera HBase Support Lead
2 6/26/13 Hadoop Summit / O'Dell, Hsieh
3. What is Apache HBase?
Apache HBase is a
reliable, column-
oriented data store that
provides consistent, low-
latency, random
read/write access.
ZK HDFS
App MR
3 6/26/13 Hadoop Summit / O'Dell, Hsieh
4. HBase Architecture
ZK HDFS
App MR
4 6/26/13 Hadoop Summit / O'Dell, Hsieh
• HBase is designed to be fault tolerant
and highly available
• It depends on other systems to be as
well.
• Replication for fault tolerance
• Serve regions from any Region server
• Failover HMasters
• ZK Quorums
• HDFS Block replication on Data Nodes
5. From the trenches at Cloudera Customer Operations
Trends Supporting HBase
6. Customers in 2011-12 vs in 2012-13
0.90.x / CDH3 era
• Red Hat 5.x
• Java jvm 1.6.13
• 4-8 disk machines
• 24-48 GB RAM
• Dual 4-core HT
• CDH3
• Apache HBase 0.90
• Apache Hadoop 0.20.x
0.92.x/0.94.x / CDH4 era
• Red Hat 6.x
• Java jvm 1.6.31
• 12-15 disk machines
• 48-96 GB RAM
• Dual 6-core HT
• CDH4
• Apache HBase 0.92/0.94
• Apache Hadoop 2.0
6 6/26/13 Hadoop Summit / O'Dell, Hsieh
7. Support Incidents 6/2011-6/2012
• Patched Bug
• Patched delivered, or
• Fixed in next version
• Operational Workaround
• Misconfiguration
• Schema design / tuning
• hbck used to fix
• Network/HW/OS
• Problems with underlying
systems.
7
Patched
12%
Workaround
(hbck)
28%
Workaround
(config)
44%
Net/HW/OS
16%
6/11-6/12 - CDH3 / 0.90.x HBase
Support Tickets
6/26/13 Hadoop Summit / O'Dell, Hsieh
8. Comparing 6/11-6/12 to 6/12-6/13
8
Patched
12%
Workaround
(hbck)
28%
Workaround
(config)
44%
Net/HW/OS
16%
6/11-6/12 - CDH3 / 0.90.x HBase
Support Tickets
Patched
14%
Workaround
(config/hbck)
36%
Net/HW/OS
42%
Documentation
8%
6/12-6/13 - CDH3+CDH4 HBase
Support Tickets
Much smaller!
Merged
config/hbck
New
category
This is
bigger!
6/26/13 Hadoop Summit / O'Dell, Hsieh
9. Comparing 2011 to 2012
• Majority customers
upgraded to CDH4.
• More customers, but similar
volume of support incidents
• Shrunk the CDH3’s largest
trouble spots significantly.
• Larger number of issues
due to underlying systems.
• This is actually a good thing!
9
Patched
14%
Workaround
(config/hbck)
36%
Net/HW/OS
42%
Documentation
8%
6/12-6/13 - CDH3+CDH4 HBase
Support Tickets
6/26/13 Hadoop Summit / O'Dell, Hsieh
15. Upgrade Assistance
• Parcels
• simplified distribution
• flexibility of install location
• side by side installs for rolling upgrades
• Rolling upgrades via CM
• hot fixes
• minor version upgrades
• Automated tests for upgrades and compatibility
15 6/26/13 Hadoop Summit / O'Dell, Hsieh
16. Configuration / Feature
• Continuous Bulk Load
• Avoid and Use Puts
• Region tuning
• Updated defaults + CM
• GC tuning
• Updated defaults + CM
• Balancer
• Manual / custom tools
• Bad Schema
• Trial and Error
16
Bug
14%
Workaround
(config/hbck)
36%
Net/HW/OS
42%
Documentation
8%
6/12-6/13 - CDH3+CDH4 HBase
Support Tickets
6/26/13 Hadoop Summit / O'Dell, Hsieh
17. CM helps
• Sanity checks on configurations
• Wizard based installation and setup
• Wizard based rolling upgrades (minor versions)
• Wizard based backup and disaster recovery strategies
17 6/26/13 Hadoop Summit / O'Dell, Hsieh
19. Support improvement wishlist
• Improved “Ergonomics”
• Better default configuration and guard rails
• “I’m sorry Dave, I can’t let you do that”
• Improved error messaging
• Suggest likely root causes in logs
• Improve log signal-to-noise ratio
• More improved ops tooling and frameworks for app development
6/26/13 Hadoop Summit / O'Dell, Hsieh19
20. Good news
• All bug fixes go into the Apache versions before CDH
• HBase is maturing
• Higher percentage of incidents by underlying OS/HW/NW
• More performance and tuning oriented questions
• Similar percentage of incidents caused by bugs
• We’re getting better
• Lower percentage of incidents managed with workarounds
• More tools in place to help operational support
• Hbck, CM, defaults
• We can still do better!
20 6/26/13 Hadoop Summit / O'Dell, Hsieh
25. Usability Concerns
• Administering HBase has been too hard.
• Difficult to see what is happening in HBase
• Easy to make bad design decisions early without realizing
• New Developments
• Metrics Revamp
• HTrace
• Frameworks for Schema design
6/26/13 Hadoop Summit / O'Dell, Hsieh25
27. HTrace
• Problem:
• Where is time being spent inside HBase?
• Solution: HTrace Framework
• Inspired by Google Dapper
• Threaded through HBase and HDFS
• Tracks time spent in calls in a distributed system by tracking spans*
on different machines.
*Some assembly still required.
6/26/13 Hadoop Summit / O'Dell, Hsieh27
28. HBase Schemas
• HBase Application developers must iterate to find a suitable HBase
schema
• Schema critical for Performance at Scale
• How can we make this easier?
• How can we reduce the expertise required to do this?
• Today:
• Lots of tuning knobs
• Developers need to understand Column Families, Rowkey design, Data
encoding, …
• Some are expensive to change after the fact
6/26/13 Hadoop Summit / O'Dell, Hsieh28
29. Row key design techniques
• Numeric Keys and lexicographic sort
• Store numbers big-endian.
• Pad ASCII numbers with 0’s.
• Use reversal to have most significant traits first.
• Reverse URL.
• Reverse timestamp to get most recent first.
• (MAX_LONG - ts) so “time” gets monotonically smaller.
• Use composite keys to make key distribute nicely and work
well with sub-scans
• Ex: User-ReverseTimeStamp
• Do not use current timestamp as first part of row key!
29
Row100
Row3
Row 31
Row003
Row031
Row100
vs.
blog.cloudera.com
hbase.apache.org
strataconf.com
vs.
com.cloudera.blog
com.strataconf
org.apache.hbase
6/26/13 Hadoop Summit / O'Dell, Hsieh
30. Row key design techniques
• Numeric Keys and lexicographic sort
• Store numbers big-endian.
• Pad ASCII numbers with 0’s.
• Use reversal to have most significant traits first.
• Reverse URL.
• Reverse timestamp to get most recent first.
• (MAX_LONG - ts) so “time” gets monotonically smaller.
• Use composite keys to make key distribute nicely and work
well with sub-scans
• Ex: User-ReverseTimeStamp
• Do not use current timestamp as first part of row key!
30
Row100
Row3
Row 31
Row003
Row031
Row100
vs.
blog.cloudera.com
hbase.apache.org
strataconf.com
vs.
com.cloudera.blog
com.strataconf
org.apache.hbase
6/26/13 Hadoop Summit / O'Dell, Hsieh
32. Reliable
Reliable / Highly Available
• Reliable:
• Ability to recover service if a
component fails, without losing data.
• Highly Available:
• Ability to quickly recover service if a
component fails, without losing data.
• Goal: Minimize downtime!
Highly Available
32 6/26/13 Hadoop Summit / O'Dell, Hsieh
33. Mean Time To Recovery (MTTR)
• Average time taken to automatically recover from a failure.
• Detection time
• Repair Time
• Notification Time
• Measure: HTrace (Dapper) Infrastructure (0.96+)
6/26/13 Hadoop Summit / O'Dell, Hsieh33
Detect Repair Notify
time
34. Reduce Detection Time
• Proactive notification of HMaster failure (0.95)
• Proactive notification of RS failure (0.95)
• Fast server failover (Hardware)
6/26/13 Hadoop Summit / O'Dell, Hsieh34
Detect Notify
time
Repair
35. Reduce Detection Time
• Proactive notification of HMaster failure (0.95)
• Proactive notification of RS failure (0.95)
• Fast server failover (Hardware)
6/26/13 Hadoop Summit / O'Dell, Hsieh35
Repair Notify
time
Detect
41. Reliable
Reliable / Highly Available
• Reliable:
• Ability to recover service if a component
fails, without losing data.
• Highly Available:
• Ability to quickly recover service if a
component fails, without losing data.
• Goal: Minimize downtime!
Highly Available
41 6/26/13 Hadoop Summit / O'Dell, Hsieh
42. Reliable
Reliable / Highly Available / Latency Tolerant
• Reliable:
• Ability to recover service if a component
fails, without losing data.
• Highly Available:
• Ability to quickly recover service if a
component fails, without losing data.
• Latency Tolerant
• Ability to perform and recover in a
predictable amount of time, without
losing data
• New Goal: Predictable performance
Highly Available
42
Latency
Tolerant
6/26/13 Hadoop Summit / O'Dell, Hsieh
43. Common causes of performance variability
• Compaction
• Garbage Collection
• Locality Loss
6/26/13 Hadoop Summit / O'Dell, Hsieh43
44. Compaction
• Compactions optimizing read layout by rewriting files
• Reduce the seeks required to read a row
• Improve random read performance
• Age off expired or deleted data
• Assumes uniformly distributed write workload
• But we have new workloads:
• Continuous Bulk load write pattern
• Time-series write pattern
6/26/13 Hadoop Summit / O'Dell, Hsieh44
45. Compactions: Put workload
• Minor compactions
• Optimizes a sub set of adjacent
files
• Major Compactions
• Optimizes all files
• Choosing:
• Assume: older files should be
larger than newer files.
• “New” files are “larger” than
“older” files? major compaction
• Else, look at newer files and
select files for a minor
compaction
6/26/13 Hadoop Summit / O'Dell, Hsieh45
Newly flushed HFiles
Minor
…
…
Minor
MajorMinor
46. Compactions: Bulkload workload
• Functionality for loading data en
masse
• Intended for Bootstrapping
HBase tables
• New write workload:
frequently ingest data only via
bulk load
• Problem:
• Breaks age/size assumption!
• Major Compaction Storms!
• Compactions unnecessarily
rewrite large files.
46 6/26/13 Hadoop Summit / O'Dell, Hsieh
Newly bulk loaded HFiles
Major
Newly flushed HFiles
MajorMajor
47. Bulkload: Exploring Compactor
• Explore all compaction
possibilities
• Choose minor compactions
that reduces # of files while
incurring least IO.
• “the best bang of the buck”
• Compaction workload is
more manageable
47 6/26/13 Hadoop Summit / O'Dell, Hsieh
Newly bulk loaded HFiles
Explore
Newly flushed HFiles
Minor
Minor
Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable.Open-source: Apache HBase is an open source project with an Apache 2.0 license.Distributed: HBase is designed to use multiple machines to store and serve data.Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk.HBase is modeled after BigTable, a system that is used for hundreds of applications at Google.
Hbase is a project that solves this problem. In a sentence, Hbase is an open source, distributed, sorted map modeled after Google’s BigTable.Open-source: Apache HBase is an open source project with an Apache 2.0 license.Distributed: HBase is designed to use multiple machines to store and serve data.Sorted Map: HBase stores data as a map, and guarantees that adjacent keys will be stored next to each other on disk.HBase is modeled after BigTable, a system that is used for hundreds of applications at Google.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.
Hannibal helped a lot with identifying balance issues.
Hannibal helped a lot with identifying balance issues.
This pie chart is a product from analyzing critical production Hbase tickets over the past 6 months: misconfig 44%, patch 12%,hw/nw 16%, repair 28%. Meaning that correcting a misconfig was all that it took to bring Hbase back up again. As you can see, misconfigurations and bugs break the most HBase clusters. Fixing bugs is up to the community. Fixing misconfigurations is up to you and the focus of the next segment. Because it’s hard to diagnose, misconfigurations are not what you want to spend your time on.If your cluster is broken, it’s probably a misconfiguration. This is a hard problem becausethe error messages are not tightly tied to the root cause.