This document provides an overview and agenda for an Apache HBase workshop. It introduces HBase as an open-source NoSQL database built on Hadoop that uses a column-family data model. The agenda covers what HBase is, its data model including rows, columns, cells and versions, CRUD operations, architecture including regions and masters, schema design best practices, and the Java API. Performance tips are given for client reads and writes such as using batches, caching, and tuning durability.
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
If you’re running HBase in production, you have to be aware of many things. In this talk we will share our experience in running and operating an HBase production cluster for a customer. To avoid common pitfalls, we’ll discuss problems and challenges we’ve faced as well as practical solutions (real-world techniques) for repair.
Even though HBase provides internal tools for diagnosing issues and for repair, running a healthy cluster can still be challenging for an administrator. We'll cover some background on these tools as well as on HBase internals such as compaction, region splits and their distribution.
We'll also introduce our tool to visualize region sizing and distribution in the cluster, that we recently open sourced.
For our eReader development project, we had to find a persistent storage for our JSON documents. After initial scanning we zeroed into two products DynamoDB and MongoDB. These slides take a deeper dive in the selection of our JSON data store.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
Learn best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your data warehouse performance.
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms.
Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi
If you’re running HBase in production, you have to be aware of many things. In this talk we will share our experience in running and operating an HBase production cluster for a customer. To avoid common pitfalls, we’ll discuss problems and challenges we’ve faced as well as practical solutions (real-world techniques) for repair.
Even though HBase provides internal tools for diagnosing issues and for repair, running a healthy cluster can still be challenging for an administrator. We'll cover some background on these tools as well as on HBase internals such as compaction, region splits and their distribution.
We'll also introduce our tool to visualize region sizing and distribution in the cluster, that we recently open sourced.
For our eReader development project, we had to find a persistent storage for our JSON documents. After initial scanning we zeroed into two products DynamoDB and MongoDB. These slides take a deeper dive in the selection of our JSON data store.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and use work load management.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
Who Should Attend:
• Data Warehouse Developers, Big Data Architects, BI Managers, and Data Engineers
Learn best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your data warehouse performance.
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceAmazon Web Services
Elasticsearch is a fully featured search engine used for real-time analytics, and Amazon Elasticsearch Service makes it easy to deploy Elasticsearch clusters on AWS. With Amazon ES, you can ingest and process billions of events per day, and explore the data using Kibana to discover patterns. In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution.
End-to-end Data Governance with Apache Avro and AtlasDataWorks Summit
Aeolus is Comcast’s new internal Big Data system for providing access to an integrated view of a wide variety of high-quality, near-real-time and batch data. Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There typically are few standards on naming conventions and data representation, and spotty documentation at best. The old rule of thumb often applies: 70% of the analysts’ time goes into data wrangling, while only 30% goes toward the actual analyses and simulations. The goal of the Athene Data Governance Platform within Aeolus is to invert this ratio. This talk will explain how Comcast is using Apache Avro and Atlas for end-to-end data governance, the challenges faced, and methods used to address these challenges.
Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published for community consumption must have an associated avro schema in Atlas. Every step in its journey through Aeolus, in flight or at rest, is captured in Atlas. Atlas’ extensibility has allowed us to add or update various entity types (e.g., avro schemas, kafka topics, object store pseudo-directories) and lineage types (e.g., storing streaming data in object storage; embellishing and re-publishing streaming data; performing aggregations and other transformations on data at rest; and evolution of schemas with compatibility flags). Transformation services notify Atlas of lineage links via custom asynchronous kafka messaging.
Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or gremlin graph query language. Example queries: “Where is data from kafka topic X stored?” “Display the journey of data currently stored in pseudo-directory X since it entered the Aeolus system”. “Show me all earlier versions of schema S, and whether they are forward/backward compatible with each other.”
Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
Moderated by Lars Hofhansl (Salesforce), with Matteo Bertozzi (Cloudera), John Leach (Splice Machine), Maxim Lukiyanov (Microsoft), Matt Mullins (Facebook), and Carter Page (Google)
The future of HBase, via a variety of viewpoints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
In this performance-oriented session, we will cover tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this webinar, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to design schemas and load data efficiently
• Learn best practices for workload management, distribution and sort keys, and optimizing queries
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use work load management, tune your queries, and use Amazon Redshift's interleaved sorting features. Finally, learn how TripAdvisor uses these best practices to give their entire organization access to analytic insights at scale.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
"No matter the industry, leading organizations need to closely integrate, deploy, secure, and scale diverse technologies to support workloads while containing costs. Nasdaq, Inc.—a leading provider of trading, clearing, and exchange technology—is no exception.
After migrating more than 1,100 tables from a legacy data warehouse into Amazon Redshift, Nasdaq, Inc. is now implementing a fully-integrated, big data architecture that also includes Amazon S3, Amazon EMR, and Presto to securely analyze large historical data sets in a highly regulated environment. Drawing from this experience, Nasdaq, Inc. shares lessons learned and best practices for deploying a highly secure, unified, big data architecture on AWS.
Attendees learn:
Architectural recommendations to extend an Amazon Redshift data warehouse with Amazon EMR and Presto.
Tips to migrate historical data from an on-premises solution and Amazon Redshift to Amazon S3, making it consumable.
Best practices for securing critical data and applications leveraging encryption, SELinux, and VPC."
Real-Time Data Exploration and Analytics with Amazon Elasticsearch ServiceAmazon Web Services
Elasticsearch is a fully featured search engine used for real-time analytics, and Amazon Elasticsearch Service makes it easy to deploy Elasticsearch clusters on AWS. With Amazon ES, you can ingest and process billions of events per day, and explore the data using Kibana to discover patterns. In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution.
End-to-end Data Governance with Apache Avro and AtlasDataWorks Summit
Aeolus is Comcast’s new internal Big Data system for providing access to an integrated view of a wide variety of high-quality, near-real-time and batch data. Such integration can enable data scientists to uncover otherwise hidden trends, anomalies, and powerful predictors of business successes and failures. But integrating data across silos in a large enterprise is fraught with peril. There typically are few standards on naming conventions and data representation, and spotty documentation at best. The old rule of thumb often applies: 70% of the analysts’ time goes into data wrangling, while only 30% goes toward the actual analyses and simulations. The goal of the Athene Data Governance Platform within Aeolus is to invert this ratio. This talk will explain how Comcast is using Apache Avro and Atlas for end-to-end data governance, the challenges faced, and methods used to address these challenges.
Avro provides a lingua franca for data representation, data integration, and schema evolution. All data published for community consumption must have an associated avro schema in Atlas. Every step in its journey through Aeolus, in flight or at rest, is captured in Atlas. Atlas’ extensibility has allowed us to add or update various entity types (e.g., avro schemas, kafka topics, object store pseudo-directories) and lineage types (e.g., storing streaming data in object storage; embellishing and re-publishing streaming data; performing aggregations and other transformations on data at rest; and evolution of schemas with compatibility flags). Transformation services notify Atlas of lineage links via custom asynchronous kafka messaging.
Atlas provides self-service data discovery and lineage browsing and querying, via full-text search, DSL query language, or gremlin graph query language. Example queries: “Where is data from kafka topic X stored?” “Display the journey of data currently stored in pseudo-directory X since it entered the Aeolus system”. “Show me all earlier versions of schema S, and whether they are forward/backward compatible with each other.”
Take an in-depth look at data warehousing with Amazon Redshift and get answers to your technical questions. We will cover performance tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
Moderated by Lars Hofhansl (Salesforce), with Matteo Bertozzi (Cloudera), John Leach (Splice Machine), Maxim Lukiyanov (Microsoft), Matt Mullins (Facebook), and Carter Page (Google)
The future of HBase, via a variety of viewpoints.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
In this performance-oriented session, we will cover tuning techniques that take advantage of Amazon Redshift's columnar technology and massively parallel processing architecture. We will also discuss best practices for migrating from existing data warehouses, optimizing your schema, loading data efficiently, and using work load management and interleaved sorting.
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all of your data for a fraction of the cost of traditional data warehouses. In this webinar, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to design schemas and load data efficiently
• Learn best practices for workload management, distribution and sort keys, and optimizing queries
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. In this session, we take an in-depth look at data warehousing with Amazon Redshift for big data analytics. We cover best practices to take advantage of Amazon Redshift's columnar technology and parallel processing capabilities to deliver high throughput and query performance. We also discuss how to design optimal schemas, load data efficiently, and use work load management.
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
(BDT401) Amazon Redshift Deep Dive: Tuning and Best PracticesAmazon Web Services
Get a look under the covers: Learn tuning best practices for taking advantage of Amazon Redshift's columnar technology and parallel processing capabilities to improve your delivery of queries and improve overall database performance. This session explains how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use work load management, tune your queries, and use Amazon Redshift's interleaved sorting features. Finally, learn how TripAdvisor uses these best practices to give their entire organization access to analytic insights at scale.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Deep Dive Amazon Redshift for Big Data Analytics - September Webinar SeriesAmazon Web Services
Analyzing big data quickly and efficiently requires a data warehouse optimized to handle and scale for large datasets. Amazon Redshift is a fast, petabyte-scale data warehouse that makes it simple and cost-effective to analyze big data for a fraction of the cost of traditional data warehouses. By following a few best practices, you can take advantage of Amazon Redshift’s columnar technology and parallel processing capabilities to minimize I/O and deliver high throughput and query performance. This webinar will cover techniques to load data efficiently, design optimal schemas, and tune query and database performance.
Learning Objectives:
• Get an inside look at Amazon Redshift's columnar technology and parallel processing capabilities
• Learn how to migrate from existing data warehouses, optimize schemas, and load data efficiently
• Learn best practices for managing workload, tuning your queries, and using Amazon Redshift's interleaved sorting features
(BDT314) A Big Data & Analytics App on Amazon EMR & Amazon RedshiftAmazon Web Services
"No matter the industry, leading organizations need to closely integrate, deploy, secure, and scale diverse technologies to support workloads while containing costs. Nasdaq, Inc.—a leading provider of trading, clearing, and exchange technology—is no exception.
After migrating more than 1,100 tables from a legacy data warehouse into Amazon Redshift, Nasdaq, Inc. is now implementing a fully-integrated, big data architecture that also includes Amazon S3, Amazon EMR, and Presto to securely analyze large historical data sets in a highly regulated environment. Drawing from this experience, Nasdaq, Inc. shares lessons learned and best practices for deploying a highly secure, unified, big data architecture on AWS.
Attendees learn:
Architectural recommendations to extend an Amazon Redshift data warehouse with Amazon EMR and Presto.
Tips to migrate historical data from an on-premises solution and Amazon Redshift to Amazon S3, making it consumable.
Best practices for securing critical data and applications leveraging encryption, SELinux, and VPC."
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
HBaseCon 2015 General Session: Zen - A Graph Data Model on HBaseHBaseCon
Zen is a storage service built at Pinterest that offers a graph data model of top of HBase and potentially other storage backends. In this talk, Zen's architects go over the design motivation for Zen and describe its internals including the API, type system, and HBase backend.
Further discussion on Data Modeling with Apache Cassandra. Overview of formal data modeling techniques as well as practical. Real-world use cases and associated data models.
I don't think it's hyperbole when I say that Facebook, Instagram, Twitter & Netflix now define the dimensions of our social & entertainment universe. But what kind of technology engines purr under the hoods of these social media machines?
Here is a tech student's perspective on making the paradigm shift to "Big Data" using innovative models: alphabet blocks, nesting dolls, & LEGOs!
Get info on:
- What is Cassandra (C*)?
- Installing C* Community Version on Amazon Web Services EC2
- Data Modelling & Database Design in C* using CQL3
- Industry Use Cases
HBaseCon 2012 | HBase Schema Design - Ian Varley, SalesforceCloudera, Inc.
Most developers are familiar with the topic of “database design”. In the relational world, normalization is the name of the game. How do things change when you’re working with a scalable, distributed, non-SQL database like HBase? This talk will cover the basics of HBase schema design at a high level and give several common patterns and examples of real-world schemas to solve interesting problems. The storage and data access architecture of HBase (row keys, column families, etc.) will be explained, along with the pros and cons of different schema decisions.
Cassandra Data Modeling - Practical Considerations @ Netflixnkorla1share
Cassandra community has consistently requested that we cover C* schema design concepts. This presentation goes in depth on the following topics:
- Schema design
- Best Practices
- Capacity Planning
- Real World Examples
The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
Flipboard services over 100 million users using heterogenous results including user generated content, interest profile, algorithmically generated content, social firehose, friends graph, ads, and web/rss crawlers. To personalize and serve these results in real time, Flipboard employs a variety of data models, access patterns and configuration. This talk will present how some of these strategies are implemented using HBase.
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
My keynote presentation about how we developed FiloDB, a distributed, Prometheus-compatible time series database, productionized it at Apple and scaled it out to handle a huge amount of operational data, based on the stack of Kafka, Cassandra, Scala/Akka.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
Introduction to HBase | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2slpJqY
This CloudxLab Introduction to HBase tutorial helps you to understand HBase in detail. Below are the topics covered in this tutorial:
1) HBase - Data Models Examples
2) Bloom Filter
3) HBase - REST APIs
4) HBase - Hands-on Demos on CloudxLab
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Vladimir Lozanov How to deliver high quality apps to the app storeАліна Шепшелей
How to plan and organize QA processes in product development team. Balance between manual and automation testing in iOS products. Importance of collaboration with other teams and how to engage all the team into product quality advocacy.
Apache Spark is an open-source parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. We will cover approaches of processing Big Data on Spark cluster for real time analytic, machine learning and iterative BI and also discuss the pros and cons of using Spark in Azure cloud.
Valerii Iakovenko Drones as the part of the presentАліна Шепшелей
Drones, these are the tools that have densely entered to our life now. These are sources of geospatial information which form the basis and supplements many systems of monitoring and control. In detail, the speech will be about agribusiness.
Dmitriy Kouperman Working with legacy systems. stabilization, monitoring, man...Аліна Шепшелей
About half of the developers, one way or another, faced with the legacy-projects. Not everyone can (and want) work with them. But with the right approach, such projects can be carried out with pleasure and even enthusiasm. We suggest that such a legacy of understanding, what are these project management techniques, practices, and explore the developers consider useful decisions: • Examples of optimization - it's worth a try; • Monitoring applications - JavaMelody; • Monitoring applications - logs and ELK (ELasticSearch + Logstash + Kibana); • Monitoring applications - Java Mission Control and Heap Dump Memory Analyzer Tool.
Anton Ivinskyi Application level metrics and performance testsАліна Шепшелей
It is important to understand how your code behaves in production, not just guess how it should behave. Know what takes time and what goes wrong. Measure it all. Be ready for the load with performance tests.
Kononenko Alina Designing for Apple Watch and Apple TVАліна Шепшелей
Apple Watch and Apple TV apps are inherently different from other apps, in both form and function. You will learn watchOS and tvOS user experience foundations and design principles, get the quick overview of the best existing solutions and possible ways of extending your projects to this platforms.
Mihail Patalaha Aso: how to start and how to finish?Аліна Шепшелей
How to collect the semantic core;
Which keywords to choose;
How to enter the keywords in the title and description;
How to understand the algorithm of Google Play;
How much traffic can be obtained by optimizing the ASO;
Tips to improve search results.
Gregory Shehet Undefined' on prod, or how to test a react appАліна Шепшелей
During the lecture we'll discuss the unit-testing of the interface. The stack of technologies: React (Redux, MobX), Mocha/Chai, React Tests Utils, Enzyme, Tape/Ava. Also, I will mention how we in Grammarly rewrite selenium to the unit tests, and how it works.
Alexey Osipenko Basics of functional reactive programmingАліна Шепшелей
During the report, we will develop incrementally construct primitives and algebra (in the worst case, just come up with a library interface) for the organization of interaction of the application with the mess in the real world. This approach will keep the application logic in the pure functions and declaratively associate external events with the necessary output.
And no React JS.
Step by step guide on building the multipurpose parser for scalable web data extraction.
Designing and usage of universal format for stripped web articles.
Format comparison with AMP(Google), Facebook Instant Articles and Apple News.
Alex Theedom Java ee revisits design patternsАліна Шепшелей
Enter "Django Channels": new way of desinging and thinking about your application. It separates transport and processing concerns in typical Django project using combination of ASGI (Asynchronous Server Gateway Interface) and worker processes, enabling your application to be "event-oriented" and implement new workflows for processing your data. How does it work? What do you need to start? Is it even useful? Learn for yourself with this introductory talk.
Alexey Tokar To find a needle in a haystackАліна Шепшелей
The talk will cover core principles of text search applicable to fixed size dictionaries. We will have a deep look at some algorithms which are deeply hidden inside huge search engines or basic search inputs on web-sites. My goal is to provide comparison between different search approaches and provide objective assessment based on complexity, memory consumption and CPU utilization of each of them.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
4. Apache HBase is
• Open source project built on top of Apache
Hadoop
• NoSQL database
• Distributed, scalable datastore
• Column-family datastore
5. Use cases
Time Series Data
• Sensor, System metrics, Events, Log files
• User Activity
• Hi Volume, Velocity Writes
Information Exchange
• Email, Chat, Inbox
• High Volume, Velocity ReadWrite
Enterprise Application Backend
• Online Catalog
• Search Index
• Pre-Computed View
• High Volume, Velocity Reads
7. Data model overview
Component Description
Table Data organized into tables
RowKey Data stored in rows; Rows identified by RowKeys
Region Rows are grouped in Regions
Column Family Columns grouped into families
Column Qualifier
(Column)
Indentifies the column
Cell Combination of the row key, column family, column, timestamp; contains the
value
Version Values within in cell versioned by version number → timestamp
8. Data model: Rows
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
a89d3211 VAL VAL VAL VAL
f091e589 VAL VAL VAL
9. Data model: Rows order
Rows are sorted in lexicographical order
+bill
04523
10942
53205
_tim
andy
josh
steve
will
10. Data model: Regions
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
… VAL
4345235b VAL
… VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
… VAL VAL VAL VAL
f091e589 VAL VAL VAL
RowKeys ranges → Regions
R1
R2
R3
11. Data model: Column Family
RowKey
contacs accounts
mobile email skype UAH USD
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
12. Data model: Column Family
• Column Families are part of the table schema and
defined on the table creation
• Columns are grouped into column families
• Column Families are stored in separate HFiles at
HDFS
• Data is grouped to Column Families by common
attribute
15. Data model: Cells
• Data is stored in KeyValue format
• Value for each cell is specified by complete
coordinates: RowKey, Column Family, Column
Qualifier, Version
29. Data write and fault tolerance
• Data writes are recorded in WAL
• Data is written to memstore
• When memstore is full -> data is written to disk in
HFile
34. Web console
Default address: master_host:60010
Shows:
• Live and dead region servers
• Region request count per second
• Tables and region sizes
• Current compactions
• Current memory state
36. Elements of Schema Design
HBase schema design is QUERY based
1.Column families determination
2.RowKey design
3.Columns usage
4.Cell versions usage
5.Column family attribute: Compression, TimeToLive,
Min/Max Versions, Im-Memory
37. Column Families determination
• Data, that accessed together should be stored
together!
• Big number of column families may avoid
performance. Optimal: ≤ 3
• Using compression may improve read performance
and reduce store data size, but affect write
performance
38. RowKey design
• Do not use sequential keys like timestamp
• Use hash for effective key distribution
• Use composite keys for effective scans
40. Tall-Narrow Vs. Flat-Wide Tables
Tall-Narrow provides better quality granularity
• Finer grained RowKey
• Works well with Get
Flat-Wide supports build-in row atomicity
• More values in a single row
• Works well to update multiple values
• Works well to get multiple associated values
41. Column Families properties
Compression
• LZO
• GZIP
• SNAPPY
Time To Live (TTL)
• Keep data for some time and then delete when TTL is passed
Versioning
• Keep fewer versions means less data in scans. Default now 1
• Combine MIN_VERSIONS with TTL to keep data older than TTL
In-Memory setting
• A setting to suggest that server keeps data in cache. Not guaranteed
• Use for small, high-access column families
43. API: All the things
• New Java API since HBase 1.0
• Table Interface for Data Operations: Put, Get, Scan,
Increment, Delete
• Admin Interface for DDL operations: Create Table,
Alter Table, Enable/Disable
46. Performance: Client reads
• Determine as much key component, as possible
• Determination of ColumnFamily reduce disk IO
• Determination of Column, Version reduce network
traffic
• Determine startRow, endRow for Scans, where
possible
• Use caching with Scans
47. Performance: Client writes
• Use batches to reduce RPC calls and improve
performance
• Use write buffer for not critical data. BufferMutator
introduced in HBase API 1.0
• Durability.ASYNC_WAL may be good balance
between performance and reliability