Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
If you’re running HBase in production, you have to be aware of many things. In this talk we will share our experience in running and operating an HBase production cluster for a customer. To avoid common pitfalls, we’ll discuss problems and challenges we’ve faced as well as practical solutions (real-world techniques) for repair.
Even though HBase provides internal tools for diagnosing issues and for repair, running a healthy cluster can still be challenging for an administrator. We'll cover some background on these tools as well as on HBase internals such as compaction, region splits and their distribution.
We'll also introduce our tool to visualize region sizing and distribution in the cluster, that we recently open sourced.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
Marcel Kornacker is a tech lead at Cloudera
In this talk from Impala architect Marcel Kornacker, you will explore: How Impala's architecture supports query speed over Hadoop data that not only convincingly exceeds that of Hive, but also that of a proprietary analytic DBMS over its own native columnar format. The current state of, and roadmap for, Impala's analytic SQL functionality. An example configuration and benchmark suite that demonstrate how Impala offers a high level of performance, functionality, and ability to handle a multi-user workload, while retaining Hadoop’s traditional strengths of flexibility and ease of scaling.
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
If you’re running HBase in production, you have to be aware of many things. In this talk we will share our experience in running and operating an HBase production cluster for a customer. To avoid common pitfalls, we’ll discuss problems and challenges we’ve faced as well as practical solutions (real-world techniques) for repair.
Even though HBase provides internal tools for diagnosing issues and for repair, running a healthy cluster can still be challenging for an administrator. We'll cover some background on these tools as well as on HBase internals such as compaction, region splits and their distribution.
We'll also introduce our tool to visualize region sizing and distribution in the cluster, that we recently open sourced.
Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.
This is a technical deep dive about Cloudera Impala, the project that makes scalable parallel databse technology available to the Hadoop community for the first time. Impala is an open-sourced code base that allows users to issue low-latency queries to data stored in HDFS and Apache HBase using familiar SQL operators.
Presenter Marcel Kornacker, creator of Impala, begins with an overview of Impala from the user's perspective, followed by an overview of Impala's architecture and implementation, and will conclude with a comparison of Impala with Dremel and Apache Hive, commercial MapReduce alternatives and traditional data warehouse infrastructure.
James Kinley from Cloudera:
An introduction to Cloudera Impala. Cloudera Impala provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.
The link to the video: http://zurichtechtalks.ch/post/37339409724/an-introduction-to-cloudera-impala-sql-on-top-of
Marcel Kornacker is a tech lead at Cloudera
In this talk from Impala architect Marcel Kornacker, you will explore: How Impala's architecture supports query speed over Hadoop data that not only convincingly exceeds that of Hive, but also that of a proprietary analytic DBMS over its own native columnar format. The current state of, and roadmap for, Impala's analytic SQL functionality. An example configuration and benchmark suite that demonstrate how Impala offers a high level of performance, functionality, and ability to handle a multi-user workload, while retaining Hadoop’s traditional strengths of flexibility and ease of scaling.
Impala Architecture Presentation at Toronto Hadoop User Group, in January 2014 by Mark Grover.
Event details:
http://www.meetup.com/TorontoHUG/events/150328602/
Learn how Cloudera Impala empowers you to:
- Perform interactive, real-time analysis directly on source data stored in Hadoop
- Interact with data in HDFS and HBase at the “speed of thought”
- Reduce data movement between systems & eliminate double storage
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
Speakers: Enis Soztutar and Devaraj Das (Hortonworks)
HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
SpringOne Platform 2016
Speaker: Lawrence Spracklen; Vice President of Engineering, Alpine Data Labs.
Data science is undoubtedly becoming a key component of every company’s core strategy for growth and increased revenue potential. To meet this market demand, the big data industry has exploded with a variety of tools to address various pieces of the data science value chain, from model scoring, to notebook interfaces, to niche algorithmic techniques. However, despite the increase in innovation in this area, many insights generated by data science teams end up “dying on the vine”. There has to be a better way of deploying operational models to end users through intuitive interfaces that they can use everyday.
In this session, we will demo how the joint solution between Alpine’s Chorus Platform and Cloud Foundry addresses this problem and closes the gap between data science insights and business value. We will demo an example of creating a machine learning model leveraging data within MPP databases such as Apache HAWQ or Greenplum Database integrated with the Chorus Platform and then deploying this as a micro service within Cloud Foundry as a scoring engine. This turn-key solution will show attendees how easy it is to plug in analytic insights into end user applications that scale, without going through lengthy development cycles.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
Apache Flink (incubating) is one of the latest addition to the Apache family of data processing engines. In short, Flink’s design aims to be as fast as in-memory engines, while providing the reliability of Hadoop. Flink contains (1) APIs in Java and Scala for both batch-processing and data streaming applications, (2) a translation stack for transforming these programs to parallel data flows and (3) a runtime that supports both proper streaming and batch processing for executing these data flows in large compute clusters.
Flink’s batch APIs build on functional primitives (map, reduce, join, cogroup, etc), and augment those with dedicated operators for iterative algorithms, and support for logical, SQL-like key attribute referencing (e.g., groupBy(“WordCount.word”). The Flink streaming API extends the primitives from the batch API with flexible window semantics.
Internally, Flink transforms the user programs into distributed data stream programs. In the course of the transformation, Flink analyzes functions and data types (using Scala macros and reflection), and picks physical execution strategies using a cost-based optimizer. Flink’s runtime is a true streaming engine, supporting both batching and streaming. Flink operates on a serialized data representation with memory-adaptive out-of-core algorithms for sorting and hashing. This makes Flink match the performance of in-memory engines on memory-resident datasets, while scaling robustly to larger disk-resident datasets.
Finally, Flink is compatible with the Hadoop ecosystem. Flink runs on YARN, reads data from HDFS and HBase, and supports mixing existing Hadoop Map and Reduce functions into Flink programs. Ongoing work is adding Apache Tez as an additional runtime backend.
This talk presents Flink from a user perspective. We introduce the APIs and highlight the most interesting design points behind Flink, discussing how they contribute to the goals of performance, robustness, and flexibility. We finally give an outlook on Flink’s development roadmap.
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
A look at why SQL access in Hadoop is critical and the benefits of a native Hadoop analytic database, what’s new with Impala 2.0 and some of the recent performance benchmarks, some common Impala use cases and production customer stories, and insight into what’s next for Impala.
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
Speakers: Enis Soztutar and Devaraj Das (Hortonworks)
HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov
Fast, demo-enabled 60-min lecture, aligned to curriculum of RDBMS / SQL course taught at Singapore University of Technology and Design (SUTD), a collaboration with MIT. More details about this lecture and some photos here: http://bit.ly/sutd-mit-lecture
Operationalizing Data Science Using Cloud FoundryVMware Tanzu
SpringOne Platform 2016
Speaker: Lawrence Spracklen; Vice President of Engineering, Alpine Data Labs.
Data science is undoubtedly becoming a key component of every company’s core strategy for growth and increased revenue potential. To meet this market demand, the big data industry has exploded with a variety of tools to address various pieces of the data science value chain, from model scoring, to notebook interfaces, to niche algorithmic techniques. However, despite the increase in innovation in this area, many insights generated by data science teams end up “dying on the vine”. There has to be a better way of deploying operational models to end users through intuitive interfaces that they can use everyday.
In this session, we will demo how the joint solution between Alpine’s Chorus Platform and Cloud Foundry addresses this problem and closes the gap between data science insights and business value. We will demo an example of creating a machine learning model leveraging data within MPP databases such as Apache HAWQ or Greenplum Database integrated with the Chorus Platform and then deploying this as a micro service within Cloud Foundry as a scoring engine. This turn-key solution will show attendees how easy it is to plug in analytic insights into end user applications that scale, without going through lengthy development cycles.
Presentations from the Cloudera Impala meetup on Aug 20 2013Cloudera, Inc.
Presentations from the Cloudera Impala meetup on Aug 20 2013:
- Nong Li on Parquet+Impala and UDF support
- Henry Robinson on performance tuning for Impala
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
The ever-increasing interest in running fast analytic scans on constantly updating data is stretching the capabilities of HDFS and NoSQL storage. Users want the fast online updates and serving of real-time data that NoSQL offers, as well as the fast scans, analytics, and processing of HDFS. Additionally, users are demanding that big data storage systems integrate natively with their existing BI and analytic technology investments, which typically use SQL as the standard query language of choice. This demand has led big data back to a familiar friend: relationally structured data storage systems.
Todd Lipcon explores the advantages of relational storage and reviews new developments, including Google Cloud Spanner and Apache Kudu, which provide a scalable relational solution for users who have too much data for a legacy high-performance analytic system. Todd explains how to address use cases that fall between HDFS and NoSQL with technologies like Apache Kudu or Google Cloud Spanner and how the combination of relational data models, SQL query support, and native API-based access enables the next generation of big data applications. Along the way, he also covers suggested architectures, the performance characteristics of Kudu and Spanner, and the deployment flexibility each option provides.
January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network
Apache Flink (incubating) is one of the latest addition to the Apache family of data processing engines. In short, Flink’s design aims to be as fast as in-memory engines, while providing the reliability of Hadoop. Flink contains (1) APIs in Java and Scala for both batch-processing and data streaming applications, (2) a translation stack for transforming these programs to parallel data flows and (3) a runtime that supports both proper streaming and batch processing for executing these data flows in large compute clusters.
Flink’s batch APIs build on functional primitives (map, reduce, join, cogroup, etc), and augment those with dedicated operators for iterative algorithms, and support for logical, SQL-like key attribute referencing (e.g., groupBy(“WordCount.word”). The Flink streaming API extends the primitives from the batch API with flexible window semantics.
Internally, Flink transforms the user programs into distributed data stream programs. In the course of the transformation, Flink analyzes functions and data types (using Scala macros and reflection), and picks physical execution strategies using a cost-based optimizer. Flink’s runtime is a true streaming engine, supporting both batching and streaming. Flink operates on a serialized data representation with memory-adaptive out-of-core algorithms for sorting and hashing. This makes Flink match the performance of in-memory engines on memory-resident datasets, while scaling robustly to larger disk-resident datasets.
Finally, Flink is compatible with the Hadoop ecosystem. Flink runs on YARN, reads data from HDFS and HBase, and supports mixing existing Hadoop Map and Reduce functions into Flink programs. Ongoing work is adding Apache Tez as an additional runtime backend.
This talk presents Flink from a user perspective. We introduce the APIs and highlight the most interesting design points behind Flink, discussing how they contribute to the goals of performance, robustness, and flexibility. We finally give an outlook on Flink’s development roadmap.
An Introduction to Cloudera Impala, shows how Impala works, and the internal processing of query of Impala, including architecture, frontend, query compilation, backend, code generation, HDFS-related stuff and performance comparison.
This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience:
- Practitioners will learn two key techniques for early success
- Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences
- Hiring managers will expand their knowledge of the skills required to bring business value with data
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics
The business cases for Hadoop can be made on the tremendous operational cost savings that it affords. But why stop there? The integration of R-powered analytics in Hadoop presents a totally new value proposition. Organizations can write R code and deploy it natively in Hadoop without data movement or the need to write their own MapReduce. Bringing R-powered predictive analytics into Hadoop will accelerate Hadoop’s value to organizations by allowing them to break through performance and scalability challenges and solve new analytic problems. Use all the data in Hadoop to discover more, grow more quickly, and operate more efficiently. Ask bigger questions. Ask new questions. Get better, faster results and share them.
From the Predictive Analytics Innovation Summit
Video here: https://www.youtube.com/watch?v=PdKUt0zK0UY
With the avalanche of data about operations, customers, and products, leading companies are utilizing Big Analytics to better understand historical patterns and predict what may come next to create sustained competitive advantage. Dan Mallinger, who leads Think Big Analytic's data science team, will focus on practical examples of where companies are implementing new analytics approaches over big data. Dan will discuss how these efforts differ from traditional analytic approaches, the organizational and business impact, and how our clients are creating new value in areas such as marketing, services, sales and product development.
Overview of JD Long's experimental "segue" package which marshals and manages Hadoop clusters (for non-Big Data problems) with Amazon's Elastic MapReduce service.
Presented at the February 2011 meeting of the Greater Boston useR Group.
Social Networks and the Richness of Datalarsgeorge
Social networks by their nature deal with large amounts of user-generated data that must be processed and presented in a time sensitive manner. Much more write intensive than previous generations of websites, social networks have been on the leading edge of non-relational persistence technology adoption. This talk presents how Germany's leading social networks Schuelervz, Studivz and Meinvz are incorporating Redis and Project Voldemort into their platform to run features like activity streams.
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
In the early days of web applications, sites were designed to serve users and gather information along the way. With the proliferation of data sources and growing user bases, the amount of data generated required new ways for storage and processing. Hadoop's HDFS and its batch oriented MapReduce opened new possibilities, yet it falls short of instant delivery of aggregate data to end users. Adding HBase and other layers, such as stream processing using Twitter's Storm, can overcome this delay and bridge the gap to realtime aggregation and reporting. This presentation takes the audience from the beginning of web application design to the current architecture, which combines multiple technologies to be able to process vast amounts of data, while still being able to react timely and report near realtime statistics.
http://berlinbuzzwords.de/sessions/batch-realtime-hadoop
HBase Applications - Atlanta HUG - May 2014larsgeorge
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
These are my slides for the 5 minute overview talk I gave during a recent workshop at the European Commission in Brussels, on the topic of "Big Data Skills in Europe".
Have a lot of data? Using or considering using Apache HBase (part of the Hadoop family) to store your data? Want to have your cake and eat it too? Phoenix is an open source project put out by Salesforce. Join us to learn how you can continue to use SQL, but get the raw speed of native HBase usage through Phoenix.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Learning Objectives - In this module, you will understand Advance Hive concepts such as UDF. Understanding columnar database HBase. Comparing SQL and NoSQL approach.
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
This Presentation will give you Information about :
1. HBase Overview and Architecture.
2. HBase Installation..
3. HBase Shel,
4. CRUD operations,
5. Scanning and Batching,
6. Hbase Filters,
7. HBase Key Design,
From: DataWorks Summit Munich 2017 - 20170406
While you could be tempted assuming data is already safe in a single Hadoop cluster, in practice you have to plan for more. Questions like: "What happens if the entire datacenter fails?, or "How do I recover into a consistent state of data, so that applications can continue to run?" are not a all trivial to answer for Hadoop. Did you know that HDFS snapshots are handling open files not as immutable? Or that HBase snapshots are executed asynchronously across servers and therefore cannot guarantee atomicity for cross region updates (which includes tables)? There is no unified and coherent data backup strategy, nor is there tooling available for many of the included components to build such a strategy. The Hadoop distributions largely avoid this topic as most customers are still in the "single use-case" or PoC phase, where data governance as far as backup and disaster recovery (BDR) is concerned are not (yet) important. This talk first is introducing you to the overarching issue and difficulties of backup and data safety, looking at each of the many components in Hadoop, including HDFS, HBase, YARN, Oozie, the management components and so on, to finally show you a viable approach using built-in tools. You will also learn not to take this topic lightheartedly and what is needed to implement and guarantee a continuous operation of Hadoop cluster based solutions.
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
This talk is about showing the complexity in building a data pipeline in Hadoop, starting with the technology aspect, and the correlating to the skillsets of current Hadoop adopters.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
6. What
is
HBase?
This
is
HBase!
Really
though…
RTFM!
(there
are
at
least
two
good
books
about
it)
6
HBase
7. IOPS
vs
Throughput
Mythbusters
It
is
all
physics
in
the
end,
you
cannot
solve
an
I/O
problem
without
reducing
I/O
in
general.
Parallelize
access
and
read/write
sequenGally.
7
8. HBase:
Strengths
&
Weaknesses
Strengths:
• Random
access
to
small(ish)
key-‐value
pairs
• Rows
and
columns
stored
sorted
lexicographically
• Adds
table
and
region
concepts
to
group
related
KVs
• Stores
and
reads
data
sequenGally
• Parallelizes
across
all
clients
•
8
Non-‐blocking
I/O
throughout
10. HBase
“Indexes”
•
Use
primary
keys,
aka
the
row
keys,
as
sorted
index
•
•
One
sort
direcGon
only
Use
“secondary
index”
to
get
reverse
sorGng
•
•
Use
secondary
keys,
aka
the
column
qualifiers,
as
sorted
index
within
main
record
•
10
Lookup
table
or
same
table
Use
prefixes
within
a
column
family
or
separate
column
families
11. HBase:
Strengths
&
Weaknesses
Weaknesses:
• Not
opGmized
(yet)
for
100%
possible
throughput
of
underlying
storage
layer
•
And
HDFS
is
not
opGmized
fully
either
Single
writer
issue
with
WALs
• Single
server
hot-‐sporng
with
non-‐distributed
keys
•
11
12. HBase
Dilemma
Although
HBase
can
host
many
applicaGons,
they
may
require
completely
opposite
features
Events
Time
Series
12
En((es
Message
Store
13. Opposite
Use-‐Case
•
EnGty
Store
•
•
•
•
•
Event
Store
•
•
•
•
13
Regular
(random)
updates
and
inserts
in
exisGng
enGty
Causes
enGty
details
being
spread
over
many
files
Needs
to
read
a
lot
of
data
to
reconsGtute
“logical”
view
WriGng
is
osen
nicely
distributed
(can
be
hashed)
One-‐off
inserts
of
events
such
as
log
entries
Access
is
osen
a
scan
over
parGGons
by
Gme
Reads
are
efficient
due
to
sequenGal
write
paLern
Writes
need
to
be
taken
care
of
to
avoid
hotsporng
15. Beyond
Batch
For
some
things
MapReduce
is
just
too
slow
Apache
Hive:
•
•
•
MapReduce
execuGon
engine
High-‐latency,
low
throughput
High
runGme
overhead
Google
realized
this
early
on
•
15
Analysts
wanted
fast,
interacGve
results
16. Dremel
Google
paper
(2010)
“scalable,
interac.ve
ad-‐hoc
query
system
for
analysis
of
read-‐only
nested
data”
Columnar
storage
format
Distributed
scalable
aggregaGon
“capable
of
running
aggrega.on
queries
over
trillion-‐row
tables
in
seconds”
hLp://research.google.com/pubs/pub36632.html
16
17. Impala:
Goals
General-‐purpose
SQL
query
engine
for
Hadoop
• For
analyGcal
and
transacGonal
workloads
• Support
queries
that
take
ms
to
hours
• Run
directly
with
Hadoop
•
•
•
•
17
Collocated
daemons
Same
file
formats
Same
storage
managers
(NN,
metastore)
18. Impala:
Goals
•
High
performance
•
•
•
•
Retain
user
experience
•
•
18
C++
runGme
code
generaGon
(LLVM)
direct
access
to
data
(no
MapReduce)
easy
for
Hive
users
to
migrate
100%
open-‐source
19. Impala:
Architecture
•
impalad
•
•
•
•
statestored
•
•
•
19
runs
on
every
node
handles
client
requests
(ODBC,
thris)
handles
query
planning
&
execuGon
provides
name
service
metadata
distribuGon
used
for
finding
data
25. Binary
to
Types
HBase
only
has
binary
keys
and
values
• Hive
and
Impala
share
the
same
metastore
which
adds
types
to
each
column
•
•
•
The
row
key
of
an
HBase
table
is
mapped
to
a
column
in
the
metastore,
i.e.
on
the
SQL
side
•
25
Can
use
Hive
or
Impala
shell
to
change
metadata
Impala
prefers
“String”
type
to
beLer
support
comparisons
and
sorGng
26. Defining
the
Schema
CREATE TABLE hbase_table_1(
key string, value string
)
STORED BY
"org.apache.hadoop.hive.hbase.HBaseStorageHandler"
WITH SERDEPROPERTIES(
"hbase.columns.mapping" = ":key,cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
26
27. Defining
the
Schema
CREATE TABLE hbase_table_1(
key string, value string
)
Maps
columns
to
fields
STORED BY
"org.apache.hadoop.hive.hbase.HBaseStorageHandler"
WITH SERDEPROPERTIES(
"hbase.columns.mapping" = ":key,cf1:val"
)
TBLPROPERTIES (
"hbase.table.name" = "xyz"
);
27
28. Mapping
OpGons
•
Can
create
a
new
table
or
map
to
an
exis(ng
one
•
•
•
CreaGng
table
through
Hive
or
Impala
does
not
set
any
table
or
column
family
proper(es
•
•
28
CREATE TABLE
vs.
CREATE EXTERNAL TABLE
Typically
not
a
good
idea
to
rely
on
defaults
BeLer
specify
compression,
TTLs,
etc.
on
HBase
side
and
then
map
as
external
table
29. Mapping
OpGons
SERDE
ProperGes
to
map
columns
to
fields
• hbase.columns.mapping
•
•
•
•
•
hbase.table.default.storage.type
•
•
•
29
Matching
count
of
entries
required
(on
SQL
side
only)
Spaces
are
not
allowed
(as
they
are
valid
characters
in
HBase)
The
“:key”
mapping
is
a
special
one
for
the
HBase
row
key
Otherwise:
column-family-name:[column-name]
[#(binary|string)
Can
be
string
(the
default)
or
binary
Defines
the
default
type
Binary
means
data
treated
like
HBase
Bytes
class
does
30. Mapping
Limits
•
Only
one
(1)
“:key”
is
allowed
•
•
But
can
be
inserted
in
SQL
schema
at
will
Access
to
HBase
KV
versions
are
not
supported
(yet)
•
•
Always
returns
the
latest
version
by
default
This
is
very
similar
to
what
a
database
user
expects
HBase
columns
not
mapped
are
not
visible
on
SQL
side
• Since
row
keys
in
HBase
are
unique,
results
may
vary
•
•
•
30
InserGng
duplicate
keys
updates
row
while
count
of
rows
stays
the
same
INSERT
OVERWRITE
does
not
delete
exisGng
rows
but
rather
updates
those
(HBase
is
mutable
aser
all!)
35. HBase
Scans
under
the
Hood
Impala
uses
Scan
instances
under
the
hood
just
as
the
naGve
Java
API
does.
This
allows
for
all
scan
opGmizaGons,
e.g.
predicate
push-‐down,
like
•
Start
and
Stop
Row
Server-‐side
Filters
• Scanner
caching
(but
not
batching
yet)
•
35
36. Configure
HBase
Scan
Details
In
impala-shell:
•
Same
as
calling
setCacheBlocks(true)
or
setCacheBlocks(false)
set hbase_cache_blocks=true;
set hbase_cache_blocks=false;
•
Same
as
calling
setCaching(rows)
set hbase_caching=1000;
36
37. HBase
Scans
under
the
Hood
Back
to
Physics:
A
scan
can
only
perform
well
if
as
few
data
is
read
as
possible.
• Need
to
issue
queries
that
are
known
not
to
be
full
table
scans
• This
requires
careful
schema
design!
Typical
use-‐cases
are
• OLAP
cube:
read
report
data
from
single
row
• Time
series:
read
fine-‐grained,
Gme
parGGoned
data
37
38. OLAP
Example
Facebook
Insights
is
using
HBase
to
keep
an
OLAP
cube
live,
i.e.
fully
materialized
• Each
row
reflect
one
tracked
page
and
contains
all
its
data
points
•
•
All
dimensions
with
Gme
bracket
prefix
plus
TTLs
During
report
Gme
only
one
or
very
few
rows
are
read
• Design
favors
read
over
write
performance
• Could
also
think
about
hybrid
system:
•
•
38
CEP
+
HBase
+
HDFS
(Parquet)
39. Time
Series
Example
•
OpenTSDB
writes
the
metric
events
bucketed
by
metric
ID
and
then
Gmestamp
•
Helps
using
all
servers
in
the
cluster
equally
During
reporGng/dashboarding
the
data
is
read
for
specific
metrics
within
a
specific
(me
frame
• Sorted
data
translates
into
effec(ve
use
of
Scan
with
start
and
stop
rows
•
39
40. Final
Notes
Since
the
HBase
scan
performance
is
mainly
influenced
by
number
of
rows
scanned
you
need
to
issue
queries
that
are
selecGve,
i.e.
scan
only
certain
rows
and
not
the
en(re
table.
This
requires
WHERE
clauses
with
the
HBase
row
key
in
it:
SELECT f1, f2, f3 FROM mapped_table
WHERE key >= "user1234" AND key <
"user1235";
“Scan
all
rows
for
user
1234,
i.e.
that
have
a
row
key
starGng
with
user1234”
-‐
might
be
a
composite
key!
40
42. Final
Notes
Not
using
the
primary
HBase
index,
aka
row
key,
results
in
a
full
table
scan
and
might
need
much
longer
(when
you
have
a
large
table.
SELECT f1, f2, f3 FROM mapped_table
WHERE f1 = ”value1” OR f20 < ”200";
This
will
result
in
a
full
table
scan.
Remember:
it
is
all
just
physics!
42
43. Final
Notes
Impala
also
uses
SingleColumnValueFilter
from
HBase
to
reduce
transferred
data
• Filters
out
enGre
rows
by
checking
a
given
column
value
• Does
not
skip
rows
since
no
index
or
Bloom
filter
is
available
to
help
idenGfy
the
next
match
Overall
this
helps
yet
cannot
do
any
magic
(physics
again!)
43
44. Final
Notes
Some
advice
on
Tall-‐narrow
vs.
flat-‐wide
table
layout:
Store
data
in
a
tall
and
narrow
table
since
there
is
currently
no
support
for
scanner
batching
(i.e.
intra
row
scanning).
Mapping,
for
example,
one
million
HBase
columns
into
SQL
is
fu(le.
This
is
sGll
true
for
Hive’s
Map
support,
since
the
enGre
row
has
to
fit
into
memory!
44
45. Outlook
Future
work:
• Composite
keys:
map
mul(ple
SQL
fields
into
a
single
composite
HBase
row
key
• Expose
KV
versions
to
SQL
schema
• BeLer
predicate
pushdown
•
45
Advanced
filter
or
indexes?