The workshop tells about HBase data model, architecture and schema design principles.
Source code demo:
https://github.com/moisieienko-valerii/hbase-workshop
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.
In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent
Several different frameworks have been developed to draw data from Kafka and maintain standard SQL over continually changing data. This provides an easy way to query and transform data - now accessible by orders of magnitude more users.
At the same time, using Standard SQL against changing data is a new pattern for many engineers and analysts. While the language hasn’t changed, we’re still in the early stages of understanding the power of SQL over Kafka - and in some interesting ways, this new pattern introduces some exciting new idioms.
In this session, we’ll start with some basic use cases of how Standard SQL can be effectively used over events in Kafka- including how these SQL engines can help teams that are brand new to streaming data get started. From there, we’ll cover a series of more advanced functions and their implications, including:
- WHERE clauses that contain time change the validity intervals of your data; you can programmatically introduce and retract records based on their payloads!
- LATERAL joins turn streams of query arguments into query results; they will automatically share their query plans and resources!
- GROUP BY aggregations can be applied to ever-growing data collections; reduce data that wouldn't even fit in a database in the first place.
We'll review in-production examples where each of these cases make unmodified Standard SQL, run and maintain over data streams in Kafka, and provide the functionality of bespoke stream processors.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...confluent
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.
Story of architecture evolution of one project from zero to Lambda Architecture. Also includes information on how we scaled cluster as soon as architecture is set up.
Contains nice performance charts after every architecture change.
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...HostedbyConfluent
Activision Data team has been running a data pipeline for a variety of Activision games for many years. Historically we used a mix of micro-batch microservices coupled with classic Big Data tools like Hadoop and Hive for ETL. As a result, it could take up to 4-6 hours for data to be available to the end customers.
In the last few years, the adoption of data in the organization skyrocketed. We needed to de-legacy our data pipeline and provide near-realtime access to data in order to improve reporting, gather insights faster, power web and mobile applications. I want to tell a story about heavily leveraging Kafka Streams and Kafka Connect to reduce the end latency to minutes, at the same time making the pipeline easier and cheaper to run. We were able to successfully validate the new data pipeline by launching two massive games just 4 weeks apart.
Enabling Insight to Support World-Class Supercomputing (Stefan Ceballos, Oak ...confluent
The Oak Ridge Leadership Facility (OLCF) in the National Center for Computational Sciences (NCCS) division at Oak Ridge National Laboratory (ORNL) houses world-class high-performance computing (HPC) resources and has a history of operating top-ranked supercomputers on the TOP500 list, including the world's current fastest, Summit, an IBM AC922 machine with a peak of 200 petaFLOPS. With the exascale era rapidly approaching, the need for a robust and scalable big data platform for operations data is more important than ever. In the past when a new HPC resource was added to the facility, pipelines from data sources spanned multiple data sinks which oftentimes resulted in data silos, slow operational data onboarding, and non-scalable data pipelines for batch processing. Using Apache Kafka as the message bus of the division's new big data platform has allowed for easier decoupling of scalable data pipelines, faster data onboarding, and stream processing with the goal to continuously improve insight into the HPC resources and their supporting systems. This talk will focus on the NCCS division's transition to Apache Kafka over the past few years to enhance the OLCF's current capabilities and prepare for Frontier, OLCF's future exascale system; including the development and deployment of a full big data platform in a Kubernetes environment from both a technical and cultural shift perspective. This talk will also cover the mission of the OLCF, the operational data insights related to high-performance computing that the organization strives for, and several use-cases that exist in production today.
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent
Several different frameworks have been developed to draw data from Kafka and maintain standard SQL over continually changing data. This provides an easy way to query and transform data - now accessible by orders of magnitude more users.
At the same time, using Standard SQL against changing data is a new pattern for many engineers and analysts. While the language hasn’t changed, we’re still in the early stages of understanding the power of SQL over Kafka - and in some interesting ways, this new pattern introduces some exciting new idioms.
In this session, we’ll start with some basic use cases of how Standard SQL can be effectively used over events in Kafka- including how these SQL engines can help teams that are brand new to streaming data get started. From there, we’ll cover a series of more advanced functions and their implications, including:
- WHERE clauses that contain time change the validity intervals of your data; you can programmatically introduce and retract records based on their payloads!
- LATERAL joins turn streams of query arguments into query results; they will automatically share their query plans and resources!
- GROUP BY aggregations can be applied to ever-growing data collections; reduce data that wouldn't even fit in a database in the first place.
We'll review in-production examples where each of these cases make unmodified Standard SQL, run and maintain over data streams in Kafka, and provide the functionality of bespoke stream processors.
Paolo Castagna is a Senior Sales Engineer at Confluent. His background is on 'big data' and he has, first hand, saw the shift happening in the industry from batch to stream processing and from big data to fast data. His talk will introduce Kafka Streams and explain why Apache Kafka is a great option and simplification for stream processing.
Keep your Metadata Repository Current with Event-Driven Updates using CDC and...confluent
The data science techniques and machine learning models that provide the greatest business value and insights require data that spans enterprise silos. To integrate this data, and ensure you’re joining on the right fields, you need a comprehensive, enterprise-wide metadata repository. More importantly, you need it to be always up to date. Nightly updates are simply not good enough when customers and users expect near-real-time responsiveness.
The challenge with keeping a metadata repository up to date lies not with cloud services or distributed storage frameworks, but rather with the relational database management systems (RDBMSs) that dot the enterprise landscape. At Comcast, we’ve found it relatively easy to feed our Apache Atlas metadata repo incrementally from Hadoop and AWS, using event-driven pushes to a dedicated Apache Kafka topic that Atlas listens to. Such pushes are not practical with RDBMSs, however, since the event-driven technique there is the database trigger. Triggers are so invasive and potentially detrimental to performance that your DB admin likely won’t allow one for detecting metadata changes.
Triggers are out. Pulling the complete current state of metadata from a RDBMS at regular intervals and calculating the deltas is too slow and unworkable. And, it turns out that out-of-the-box log-based change data capture (CDC) is also dead-end because metadata changes are represented in transaction logs as SQL DDL strings, not as atomic insert/update/delete operations as for data.
So, how do you keep your metadata repository always up to date with the current state of your RDBMS metadata? Our group solved this challenge by creating an alternate method for CDC on RDBMS metadata based on database system tables. Our query-based CDC serves as a Kafka Connect source for our Apache Atlas sink, providing event-driven, continuous updates to RDBMS metadata in our repository, but does not suffer from the usual limitations/disadvantages of vanilla query-based CDC. If you’re facing a similar challenge, join us at this session to learn more about the obstacles you’ll likely face and how you can overcome them using the method we implemented.
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...HostedbyConfluent
DataOps challenges us to build data experiences in a repeatable way. For those with Kafka, this means finding a means of deploying flows in an automated and consistent fashion.
The challenge is to make the deployment of Kafka flows consistent across different technologies and systems: the topics, the schemas, the monitoring rules, the credentials, the connectors, the stream processing apps. And ideally not coupled to a particular infrastructure stack.
In this talk we will discuss the different approaches and benefits/disadvantages to automating the deployment of Kafka flows including Git operators and Kubernetes operators. We will walk through and demo deploying a flow on AWS EKS with MSK and Kafka Connect using GitOps practices: including a stream processing application, S3 connector with credentials held in AWS Secrets Manager.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Maximize the Business Value of Machine Learning and Data Science with Kafka (...confluent
Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex.
At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
Kappa architecture for event processing using Apache Kafka and Querona for managing data, joining external data sources and empowering data science teams.
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
Responding to a global pandemic presents a unique set of technical and public health challenges. The real challenge is the ability to gather data coming in via many data streams in variety of formats influences the real-world outcome and impacts everyone. The Centers for Disease Control and Prevention CELR (COVID Electronic Lab Reporting) program was established to rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners. Confluent Kafka with KStreams and Connect play a critical role in program objectives to:
o Track the threat of COVID-19 virus
o Provide comprehensive data for local, state, and federal response
o Better understand locations with an increase in incidence
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
If a real-time dashboard takes 5 minutes to refresh, it’s not real-time. With data lakes increasingly enabling massive amounts of unprocessed data sets, delivering low-latency analytics is not for the faint-hearted. Learn how to stream massive amounts of data which used to be impossible to handle from Kafka, to serve real-time applications using lake-scale optimized approaches to storage and indexing.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...HostedbyConfluent
Despite great advances in Kafka's SaaS offerings it can still be challenging to create a sustainable event-driven ecosystem. Often platform engineers become de facto ‘gatekeepers’ of events & topics, yet their day job is not about data modelling or domain expertise. We've all seen the bottlenecks these unsustainable processes create.
Realising the potential of event streams requires much more than infrastructure. Beyond an event-driven mindset, it requires domain experts to lead creation of well-defined discoverable events through fit-for-purpose governance. AsyncAPI is the OpenAPI for events that can form the basis of the required self-governing, self-service eventing framework.
This session will introduce a self-governing framework using AsyncAPI and share how the Bank of New Zealand applied this framework to leverage a passionate Kafka community and embed event-driven thinking. You’ll leave with a tangible set of ideas to give your own events a bit more swagger using AsyncAPI.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
Machine learning has its challenges, and understanding the algorithms is not always easy. In this session, you’ll discover methods to make these challenges less daunting.
Intended for software engineers who need to understand the requirements and constraints of data scientists, and data scientists who need to implement or help implement production systems, the session will begin with a quick introduction to data quality and a level-set on common vocabulary. You’ll then explore the formats that are required by Spark ML to run its algorithms, and see how to automate the build through user-defined functions and other techniques. Automation will make reproducibility easy, minimize errors and increase the efficiency of data scientists.
Key takeaways will include:
– How to build the required tool set in Java
– Understanding the formats required by Spark ML (a new vocabulary)
– Learning fundamentals about data quality and how to make sure the data is usable
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookDatabricks
Tangram is a state-of-art resource allocator and distributed scheduling framework for Spark at Facebook with hierarchical queues and a resource based container abstraction. We support scheduling and resource management for a significant portion of Facebook's data warehouse and machine learning workloads that equates to running millions of jobs across several clusters with tens of thousands of machines. In this talk, we will describe Tangram's architecture, discuss Facebook's need for a custom scheduler, and explain how Tangram schedules Spark workloads at scale. We will specifically focus on several important features around improving Spark's efficiency, usability and reliability: 1. IO-rebalancer (Tetris) Support 2. User-Fairness Queueing 3. Heuristic-Based Backfill Scheduling Optimizations.
Cloud-Based Event Stream Processing Architectures and Patterns with Apache Ka...HostedbyConfluent
The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.
DataOps Automation for a Kafka Streaming Platform (Andrew Stevenson + Spiros ...HostedbyConfluent
DataOps challenges us to build data experiences in a repeatable way. For those with Kafka, this means finding a means of deploying flows in an automated and consistent fashion.
The challenge is to make the deployment of Kafka flows consistent across different technologies and systems: the topics, the schemas, the monitoring rules, the credentials, the connectors, the stream processing apps. And ideally not coupled to a particular infrastructure stack.
In this talk we will discuss the different approaches and benefits/disadvantages to automating the deployment of Kafka flows including Git operators and Kubernetes operators. We will walk through and demo deploying a flow on AWS EKS with MSK and Kafka Connect using GitOps practices: including a stream processing application, S3 connector with credentials held in AWS Secrets Manager.
Building Realtim Data Pipelines with Kafka Connect and Spark StreamingGuozhang Wang
Spark Streaming makes it easy to build scalable, robust stream processing applications — but only once you’ve made your data accessible to the framework. Spark Streaming solves the realtime data processing problem, but to build large scale data pipeline we need to combine it with another tool that addresses data integration challenges. The Apache Kafka project recently introduced a new tool, Kafka Connect, to make data import/export to and from Kafka easier.
Maximize the Business Value of Machine Learning and Data Science with Kafka (...confluent
Today, many companies that have lots of data are still struggling to derive value from machine learning (ML) and data science investments. Why? Accessing the data may be difficult. Or maybe it’s poorly labeled. Or vital context is missing. Or there are questions around data integrity. Or standing up an ML service can be cumbersome and complex.
At Nuuly, we offer an innovative clothing rental subscription model and are continually evolving our ML solutions to gain insight into the behaviors of our unique customer base as well as provide personalized services. In this session, I’ll share how we used event streaming with Apache Kafka® and Confluent Cloud to address many of the challenges that may be keeping your organization from maximizing the business value of machine learning and data science. First, you’ll see how we ensure that every customer interaction and its business context is collected. Next, I’ll explain how we can replay entire interaction histories using Kafka as a transport layer as well as a persistence layer and a business application processing layer. Order management, inventory management, logistics, subscription management – all of it integrates with Kafka as the common backbone. These data streams enable Nuuly to rapidly prototype and deploy dynamic ML models to support various domains, including pricing, recommendations, product similarity, and warehouse optimization. Join us and learn how Kafka can help improve machine learning and data science initiatives that may not be delivered to their full potential.
A stream processing platform is not an island unto itself; it must be connected to all of your existing data systems, applications, and sources. In this talk we will provide different options for integrating systems and applications with Apache Kafka, with a focus on the Kafka Connect framework and the ecosystem of Kafka connectors. We will discuss the intended use cases for Kafka Connect and share our experience and best practices for building large-scale data pipelines using Apache Kafka.
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
Kappa architecture for event processing using Apache Kafka and Querona for managing data, joining external data sources and empowering data science teams.
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
A Collaborative Data Science Development WorkflowDatabricks
Collaborative data science workflows have several moving parts, and many organizations struggle with developing an efficient and scalable process. Our solution consists of data scientists individually building and testing Kedro pipelines and measuring performance using MLflow tracking. Once a strong solution is created, the candidate pipeline is trained on cloud-agnostic, GPU-enabled containers. If this pipeline is production worthy, the resulting model is served to a production application through MLflow.
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
Responding to a global pandemic presents a unique set of technical and public health challenges. The real challenge is the ability to gather data coming in via many data streams in variety of formats influences the real-world outcome and impacts everyone. The Centers for Disease Control and Prevention CELR (COVID Electronic Lab Reporting) program was established to rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners. Confluent Kafka with KStreams and Connect play a critical role in program objectives to:
o Track the threat of COVID-19 virus
o Provide comprehensive data for local, state, and federal response
o Better understand locations with an increase in incidence
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...HostedbyConfluent
If a real-time dashboard takes 5 minutes to refresh, it’s not real-time. With data lakes increasingly enabling massive amounts of unprocessed data sets, delivering low-latency analytics is not for the faint-hearted. Learn how to stream massive amounts of data which used to be impossible to handle from Kafka, to serve real-time applications using lake-scale optimized approaches to storage and indexing.
This is the slide deck which was used for a talk 'Change Data Capture using Kafka' at Kafka Meetup at Linkedin (Bangalore) held on 11th June 2016.
The talk describes the need for CDC and why it's a good use case for Kafka.
Self-service Events & Decentralised Governance with AsyncAPI: A Real World Ex...HostedbyConfluent
Despite great advances in Kafka's SaaS offerings it can still be challenging to create a sustainable event-driven ecosystem. Often platform engineers become de facto ‘gatekeepers’ of events & topics, yet their day job is not about data modelling or domain expertise. We've all seen the bottlenecks these unsustainable processes create.
Realising the potential of event streams requires much more than infrastructure. Beyond an event-driven mindset, it requires domain experts to lead creation of well-defined discoverable events through fit-for-purpose governance. AsyncAPI is the OpenAPI for events that can form the basis of the required self-governing, self-service eventing framework.
This session will introduce a self-governing framework using AsyncAPI and share how the Bank of New Zealand applied this framework to leverage a passionate Kafka community and embed event-driven thinking. You’ll leave with a tangible set of ideas to give your own events a bit more swagger using AsyncAPI.
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
Machine learning has its challenges, and understanding the algorithms is not always easy. In this session, you’ll discover methods to make these challenges less daunting.
Intended for software engineers who need to understand the requirements and constraints of data scientists, and data scientists who need to implement or help implement production systems, the session will begin with a quick introduction to data quality and a level-set on common vocabulary. You’ll then explore the formats that are required by Spark ML to run its algorithms, and see how to automate the build through user-defined functions and other techniques. Automation will make reproducibility easy, minimize errors and increase the efficiency of data scientists.
Key takeaways will include:
– How to build the required tool set in Java
– Understanding the formats required by Spark ML (a new vocabulary)
– Learning fundamentals about data quality and how to make sure the data is usable
Tangram: Distributed Scheduling Framework for Apache Spark at FacebookDatabricks
Tangram is a state-of-art resource allocator and distributed scheduling framework for Spark at Facebook with hierarchical queues and a resource based container abstraction. We support scheduling and resource management for a significant portion of Facebook's data warehouse and machine learning workloads that equates to running millions of jobs across several clusters with tens of thousands of machines. In this talk, we will describe Tangram's architecture, discuss Facebook's need for a custom scheduler, and explain how Tangram schedules Spark workloads at scale. We will specifically focus on several important features around improving Spark's efficiency, usability and reliability: 1. IO-rebalancer (Tetris) Support 2. User-Fairness Queueing 3. Heuristic-Based Backfill Scheduling Optimizations.
The main topic of slides is building high availability high throughput system for receiveing and saving different kind of information with horizontal scalling possibility using HBase, Flume and Grizzly hosted on Amazon EC2 low cost instances. Talk describes HBase HA cluster setup process with useful hints and EC2 pitfalls, Flume setup process with providing comparasion between standalone and embedded Flume versions and show difference and usecases of both versions. A lot of attention payed to Flume2Hbase streaming features with tweaks and different approaches for speeding up this process.
DuyHai DOAN - Real time analytics with Cassandra and Spark - NoSQL matters Pa...NoSQLmatters
Apache Spark is a general data processing framework which allows you perform map-reduce tasks (but not only) in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting alternative to the Hadoop eco-system for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.
AWS Simple Workflow: Distributed Out of the Box! - Morning@LohikaSerhiy Batyuk
Do you have a lot of complex jobs that you need to run as part of your application? Do they consist of multiple tasks and you wonder how to orchestrate them properly? Do you want to be able to easily scale their execution? Is availability of your workers important to you? If you answer “Yes” to these questions then AWS Simple Workflow is the right tool for you.
In this talk we will go through Amazon SWF and Java Flow Framework and you will see how to get a distributed job execution engine right out of the box. We will also compare SWF to alternative solutions, discuss real life experience, and of course enjoy a live demo.
The talk will be most useful to everyone who is interested in the design of distributed systems and is new to AWS SWF.
Spark is fast and general engine for large-scale data processing which can solve all of your problems.
… Or can it?
This talk will cover real-world issues encountered during migration of the existing product to Spark infrastructure.
Aimed at software engineers that just started to evaluate Spark or those who are already using it.
The workshop is based on several Nikita Salnikov-Tarnovski lectures + my own research. The workshop consists of 2 parts. The first part covers:
- different Java GCs, their main features, advantages and disadvantages;
- principles of GC tuning;
- work with GC Viewer as tool for GC analysis;
- first steps tuning demo;
- comparison primary GCs on Java 1.7 and Java 1.8
The second part covers:
- work with Off-Heap: ByteBuffer / Direct ByteBuffer / Unsafe / MapDB;
- examples and comparison of approaches;
The off-heap-demo: https://github.com/moisieienko-valerii/off-heap-demo
This presentation will be useful to those who would like to get acquainted with Apache Spark architecture, top features and see some of them in action, e.g. RDD transformations and actions, Spark SQL, etc. Also it covers real life use cases related to one of ours commercial projects and recall roadmap how we’ve integrated Apache Spark into it.
Was presented on Morning@Lohika tech talks in Lviv.
Design by Yarko Filevych: http://www.filevych.com/
Memory Management: What You Need to Know When Moving to Java 8AppDynamics
This presentation will compare and contrast application behavior in Java 7 with Java 8, particularly focusing on memory management and usage. Several code examples are presented to show how to recognize and respond to common pitfalls.
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...DataStax
Deleting data from Cassandra has several challenges, and existing solutions (tombstones or TTLs) have limitations that make them unusable or untenable in certain circumstances. We'll explore the cases where existing deletion options fail or are inadequate, then describe a solution we developed which deletes data from Cassandra during standard or user-defined compaction, but without resorting to tombstones or TTL's.
About the Speaker
Eric Stevens Principal Architect, ProtectWise, Inc.
Eric is the principal architect, and day one employee of ProtectWise, Inc., specializing in massive real time processing and scalability problems. The team at ProtectWise processes, analyzes, optimizes, indexes, and stores billions of network packets each second. They look for threats in real time, but also store full fidelity network data (including PCAP), and when new security intelligence is received, automatically replay existing network history through that new intelligence.
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
From: DataWorks Summit 2017 - Munich - 20170406
HBase hast established itself as the backend for many operational and interactive use-cases, powering well-known services that support millions of users and thousands of concurrent requests. In terms of features HBase has come a long way, overing advanced options such as multi-level caching on- and off-heap, pluggable request handling, fast recovery options such as region replicas, table snapshots for data governance, tuneable write-ahead logging and so on. This talk is based on the research for the an upcoming second release of the speakers HBase book, correlated with the practical experience in medium to large HBase projects around the world. You will learn how to plan for HBase, starting with the selection of the matching use-cases, to determining the number of servers needed, leading into performance tuning options. There is no reason to be afraid of using HBase, but knowing its basic premises and technical choices will make using it much more successful. You will also learn about many of the new features of HBase up to version 1.3, and where they are applicable.
HBase Advanced Schema Design - Berlin Buzzwords - June 2012larsgeorge
While running a simple key/value based solution on HBase usually requires an equally simple schema, it is less trivial to operate a different application that has to insert thousands of records per second. This talk will address the architectural challenges when designing for either read or write performance imposed by HBase. It will include examples of real world use-cases and how they
http://berlinbuzzwords.de/sessions/advanced-hbase-schema-design
Flipboard services over 100 million users using heterogenous results including user generated content, interest profile, algorithmically generated content, social firehose, friends graph, ads, and web/rss crawlers. To personalize and serve these results in real time, Flipboard employs a variety of data models, access patterns and configuration. This talk will present how some of these strategies are implemented using HBase.
This talk delves into the many ways that a user has to use HBase in a project. Lars will look at many practical examples based on real applications in production, for example, on Facebook and eBay and the right approach for those wanting to find their own implementation. He will also discuss advanced concepts, such as counters, coprocessors and schema design.
In this introduction to Apache Hive the following topics are covered:
1. Hive Introduction
2. Hive origin
3. Where does Hive fall in Big Data stack
4. Hive architecture
5. Tts job execution mechanisms
6. HiveQL and Hive Shell
7 Types of tables
8. Querying data
9. Partitioning
10. Bucketing
11. Pros
12. Limitations of Hive
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
We will start from understanding how Real-Time Analytics can be implemented on Enterprise Level Infrastructure and will go to details and discover how different cases of business intelligence be used in real-time on streaming data. We will cover different Stream Data Processing Architectures and discus their benefits and disadvantages. I'll show with live demos how to build Fast Data Platform in Azure Cloud using open source projects: Apache Kafka, Apache Cassandra, Mesos. Also I'll show examples and code from real projects.
Cloudera Impala: The Open Source, Distributed SQL Query Engine for Big Data. The Cloudera Impala project is pioneering the next generation of Hadoop capabilities: the convergence of fast SQL queries with the capacity, scalability, and flexibility of a Apache Hadoop cluster. With Impala, the Hadoop ecosystem now has an open-source codebase that helps users query data stored in Hadoop-based enterprise data hubs in real time, using familiar SQL syntax.
This talk will begin with an overview of the challenges organizations face as they collect and process more data than ever before, followed by an overview of Impala from the user's perspective and a dive into Impala's architecture. It concludes with stories of how Cloudera's customers are using Impala and the benefits they see.
Apache HBase™ is the Hadoop database, a distributed, salable, big data store.Its a column-oriented database management system that runs on top of HDFS.
Apache HBase is an open source NoSQL database that provides real-time read/write access to those large data sets. ... HBase is natively integrated with Hadoop and works seamlessly alongside other data access engines through YARN.
With the public confession of Facebook, HBase is on everyone's lips when it comes to the discussion around the new "NoSQL" area of databases. In this talk, Lars will introduce and present a comprehensive overview of HBase. This includes the history of HBase, the underlying architecture, available interfaces, and integration with Hadoop.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
How to Position Your Globus Data Portal for Success Ten Good PracticesGlobus
Science gateways allow science and engineering communities to access shared data, software, computing services, and instruments. Science gateways have gained a lot of traction in the last twenty years, as evidenced by projects such as the Science Gateways Community Institute (SGCI) and the Center of Excellence on Science Gateways (SGX3) in the US, The Australian Research Data Commons (ARDC) and its platforms in Australia, and the projects around Virtual Research Environments in Europe. A few mature frameworks have evolved with their different strengths and foci and have been taken up by a larger community such as the Globus Data Portal, Hubzero, Tapis, and Galaxy. However, even when gateways are built on successful frameworks, they continue to face the challenges of ongoing maintenance costs and how to meet the ever-expanding needs of the community they serve with enhanced features. It is not uncommon that gateways with compelling use cases are nonetheless unable to get past the prototype phase and become a full production service, or if they do, they don't survive more than a couple of years. While there is no guaranteed pathway to success, it seems likely that for any gateway there is a need for a strong community and/or solid funding streams to create and sustain its success. With over twenty years of examples to draw from, this presentation goes into detail for ten factors common to successful and enduring gateways that effectively serve as best practices for any new or developing gateway.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Modern design is crucial in today's digital environment, and this is especially true for SharePoint intranets. The design of these digital hubs is critical to user engagement and productivity enhancement. They are the cornerstone of internal collaboration and interaction within enterprises.
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
Multiple Your Crypto Portfolio with the Innovative Features of Advanced Crypt...Hivelance Technology
Cryptocurrency trading bots are computer programs designed to automate buying, selling, and managing cryptocurrency transactions. These bots utilize advanced algorithms and machine learning techniques to analyze market data, identify trading opportunities, and execute trades on behalf of their users. By automating the decision-making process, crypto trading bots can react to market changes faster than human traders
Hivelance, a leading provider of cryptocurrency trading bot development services, stands out as the premier choice for crypto traders and developers. Hivelance boasts a team of seasoned cryptocurrency experts and software engineers who deeply understand the crypto market and the latest trends in automated trading, Hivelance leverages the latest technologies and tools in the industry, including advanced AI and machine learning algorithms, to create highly efficient and adaptable crypto trading bots
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
SOCRadar Research Team: Latest Activities of IntelBrokerSOCRadar
The European Union Agency for Law Enforcement Cooperation (Europol) has suffered an alleged data breach after a notorious threat actor claimed to have exfiltrated data from its systems. Infamous data leaker IntelBroker posted on the even more infamous BreachForums hacking forum, saying that Europol suffered a data breach this month.
The alleged breach affected Europol agencies CCSE, EC3, Europol Platform for Experts, Law Enforcement Forum, and SIRIUS. Infiltration of these entities can disrupt ongoing investigations and compromise sensitive intelligence shared among international law enforcement agencies.
However, this is neither the first nor the last activity of IntekBroker. We have compiled for you what happened in the last few days. To track such hacker activities on dark web sources like hacker forums, private Telegram channels, and other hidden platforms where cyber threats often originate, you can check SOCRadar’s Dark Web News.
Stay Informed on Threat Actors’ Activity on the Dark Web with SOCRadar!
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns
Unlocking Business Potential: Tailored Technology Solutions by Prosigns
Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support.
Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth.
Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices.
AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making.
Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency.
DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration.
Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly.
Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business.
Join us on a journey of innovation and growth. Let's partner for success with Prosigns.
Designing for Privacy in Amazon Web ServicesKrzysztofKkol1
Data privacy is one of the most critical issues that businesses face. This presentation shares insights on the principles and best practices for ensuring the resilience and security of your workload.
Drawing on a real-life project from the HR industry, the various challenges will be demonstrated: data protection, self-healing, business continuity, security, and transparency of data processing. This systematized approach allowed to create a secure AWS cloud infrastructure that not only met strict compliance rules but also exceeded the client's expectations.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
4. Apache HBase is
• Open source project built on top of Apache
Hadoop
• NoSQL database
• Distributed, scalable datastore
• Column-family datastore
5. Use cases
Time Series Data
• Sensor, System metrics, Events, Log files
• User Activity
• Hi Volume, Velocity Writes
Information Exchange
• Email, Chat, Inbox
• High Volume, Velocity ReadWrite
Enterprise Application Backend
• Online Catalog
• Search Index
• Pre-Computed View
• High Volume, Velocity Reads
7. Data model overview
Component Description
Table Data organized into tables
RowKey Data stored in rows; Rows identified by RowKeys
Region Rows are grouped in Regions
Column Family Columns grouped into families
Column Qualifier
(Column)
Indentifies the column
Cell Combination of the row key, column family, column, timestamp; contains the
value
Version Values within in cell versioned by version number → timestamp
8. Data model: Rows
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
a89d3211 VAL VAL VAL VAL
f091e589 VAL VAL VAL
9. Data model: Rows order
Rows are sorted in lexicographical order
+bill
04523
10942
53205
_tim
andy
josh
steve
will
10. Data model: Regions
RowKey
contacs accounts …
mobile email skype UAH USD …
084ab67e VAL VAL
2333bbac VAL VAL
… VAL
4345235b VAL
… VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
… VAL VAL VAL VAL
f091e589 VAL VAL VAL
RowKeys ranges → Regions
R1
R2
R3
11. Data model: Column Family
RowKey
contacs accounts
mobile email skype UAH USD
084ab67e VAL VAL
2333bbac VAL VAL
342bbecc VAL
4345235b VAL
565c4f8f VAL VAL VAL
675555ab VAL VAL VAL VAL VAL
9745c563 VAL VAL
12. Data model: Column Family
• Column Families are part of the table schema and
defined on the table creation
• Columns are grouped into column families
• Column Families are stored in separate HFiles at
HDFS
• Data is grouped to Column Families by common
attribute
15. Data model: Cells
• Data is stored in KeyValue format
• Value for each cell is specified by complete
coordinates: RowKey, Column Family, Column
Qualifier, Version
29. Data write and fault tolerance
• Data writes are recorded in WAL
• Data is written to memstore
• When memstore is full -> data is written to disk in
HFile
34. Web console
Default address: master_host:60010
Shows:
• Live and dead region servers
• Region request count per second
• Tables and region sizes
• Current compactions
• Current memory state
36. Elements of Schema Design
HBase schema design is QUERY based
1.Column families determination
2.RowKey design
3.Columns usage
4.Cell versions usage
5.Column family attribute: Compression, TimeToLive,
Min/Max Versions, Im-Memory
37. Column Families determination
• Data, that accessed together should be stored
together!
• Big number of column families may avoid
performance. Optimal: ≤ 3
• Using compression may improve read performance
and reduce store data size, but affect write
performance
38. RowKey design
• Do not use sequential keys like timestamp
• Use hash for effective key distribution
• Use composite keys for effective scans
40. Tall-Narrow Vs. Flat-Wide Tables
Tall-Narrow provides better quality granularity
• Finer grained RowKey
• Works well with Get
Flat-Wide supports build-in row atomicity
• More values in a single row
• Works well to update multiple values
• Works well to get multiple associated values
41. Column Families properties
Compression
• LZO
• GZIP
• SNAPPY
Time To Live (TTL)
• Keep data for some time and then delete when TTL is passed
Versioning
• Keep fewer versions means less data in scans. Default now 1
• Combine MIN_VERSIONS with TTL to keep data older than TTL
In-Memory setting
• A setting to suggest that server keeps data in cache. Not guaranteed
• Use for small, high-access column families
43. API: All the things
• New Java API since HBase 1.0
• Table Interface for Data Operations: Put, Get, Scan,
Increment, Delete
• Admin Interface for DDL operations: Create Table,
Alter Table, Enable/Disable
46. Performance: Client reads
• Determine as much key component, as possible
• Determination of ColumnFamily reduce disk IO
• Determination of Column, Version reduce network
traffic
• Determine startRow, endRow for Scans, where
possible
• Use caching with Scans
47. Performance: Client writes
• Use batches to reduce RPC calls and improve
performance
• Use write buffer for not critical data. BufferMutator
introduced in HBase API 1.0
• Durability.ASYNC_WAL may be good balance
between performance and reliability