Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
An overview of Apache Phoenix and Apache HBase from the angle of a traditional data warehousing solution. This talk focuses on where this open-source architect fits into the market outlines the features and integrations of the product, showing that it is a viable alternative to traditional data warehousing solutions.
Apache phoenix: Past, Present and Future of SQL over HBAseenissoz
HBase as the NoSQL database of choice in the Hadoop ecosystem has already been proven itself in scale and in many mission critical workloads in hundreds of companies. Phoenix as the SQL layer on top of HBase, has been increasingly becoming the tool of choice as the perfect complementary for HBase. Phoenix is now being used more and more for super low latency querying and fast analytics across a large number of users in production deployments. In this talk, we will cover what makes Phoenix attractive among current and prospective HBase users, like SQL support, JDBC, data modeling, secondary indexing, UDFs, and also go over recent improvements like Query Server, ODBC drivers, ACID transactions, Spark integration, etc. We will conclude by looking into items in the pipeline and how Phoenix and HBase interacts with other engines like Hive and Spark.
Apache Phoenix and Apache HBase: An Enterprise Grade Data WarehouseJosh Elser
An overview of Apache Phoenix and Apache HBase from the angle of a traditional data warehousing solution. This talk focuses on where this open-source architect fits into the market outlines the features and integrations of the product, showing that it is a viable alternative to traditional data warehousing solutions.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
HBASE is the leading NoSQL database. Tightly integrated with Hadoop ecosystem, it offers random, real-time read/write capabilities on billions of rows and millions of columns. Apache Phoenix offers a SQL interface to HBASE, opening HBase to large community of SQL developers and enabling inter-operability with SQL compliant applications. The session will cover the essentials of HBASE and provide an in-depth insight into Apache Phoenix. Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community. Recording:
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=de6d0c435c0761adedf3114a100e7483%20
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
Speaker: Adam Warrington (Cloudera)
The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.
This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
In this session, you will learn the work Xiaomi has done to improve the availability and stability of our HBase clusters, including cross-site data and service backup and a coordinated compaction framework. You'll also learn about the Themis framework, which supports cross-row transactions on HBase based on Google's percolator algorithm, and its usage in Xiaomi's applications.
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
Speakers: Enis Soztutar and Devaraj Das (Hortonworks)
HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.
Hbase and phoenix usage at eHarmony. Presented the lambda architecture and implementation of HBase and phoenix usage in eharmony at Apache PhoenixCon 2016.
In this talk, we discuss how three different types of data stores help eHarmony overcome challenges in scaling various parts of its matching and recommendations stack. We establish the challenge, identify requirements and discuss how specific types of data stores (Document/Key Value/Graph) help us overcome those challenges.
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon
Phoenix has evolved to become a full-fledged relational database layer over HBase data. We'll discuss the fundamental principles of how Phoenix pushes the computation to the server and why this leads to performance enabling direct support of low-latency applications, along with some major new features. Next, we'll outline our approach for transaction support in Phoenix, a work in-progress, and discuss the pros and cons of the various approaches. Lastly, we'll examine the current means of integrating Phoenix with the rest of the Hadoop ecosystem.
Hortonworks Technical Workshop: HBase and Apache Phoenix Hortonworks
HBASE is the leading NoSQL database. Tightly integrated with Hadoop ecosystem, it offers random, real-time read/write capabilities on billions of rows and millions of columns. Apache Phoenix offers a SQL interface to HBASE, opening HBase to large community of SQL developers and enabling inter-operability with SQL compliant applications. The session will cover the essentials of HBASE and provide an in-depth insight into Apache Phoenix. Audience: Developers, Architects and System Engineers from the Hortonworks Technology Partner community. Recording:
https://hortonworks.webex.com/hortonworks/lsr.php?RCID=de6d0c435c0761adedf3114a100e7483%20
HBaseCon 2012 | Mignify: A Big Data Refinery Built on HBase - Internet Memory...Cloudera, Inc.
Mignify is a platform for collecting, storing and analyzing Big Data harvested from the web. It aims at providing an easy access to focused and structured information extracted from Web data flows. It consists of a distributed crawler, a resource-oriented storage based on HDFS and HBase, and an extraction framework that produces filtered, enriched, and aggregated data from large document collections, including the temporal aspect. The whole system is deployed in an innovative hardware architecture comprising of a high number of small (low-consumption) nodes. This talk will tackle the decisions made along the design and development of the platform, both under a technical and functional perspective. It will introduce the cloud infrastructure, the LTE-like ingestion of the crawler output into HBase/HDFS, and the triggering mechanism of analytics based on a declarative filter/extraction specification. The design choices will be illustrated with a pilot application targeting Daily Web Monitoring in the context of a national domain.
HBase Data Modeling and Access Patterns with Kite SDKHBaseCon
Speaker: Adam Warrington (Cloudera)
The Kite SDK is a set of libraries and tools focused on making it easier to build systems on top of the Hadoop ecosystem. HBase support has recently been added to the Kite SDK Data Module, which allows a developer to model and access data in HBase consistent with how they would model data in HDFS using Kite. This talk will focus on Kite's HBase support by covering Kite basics and moving through the specifics of working with HBase as a data source. This feature overview will be supplemented by specifics of how that feature is being used in production applications at Cloudera.
HBase can be an intimidating beast for someone considering its adoption. For what kinds of workloads is it well suited? How does it integrate into the rest of my application infrastructure? What are the data semantics upon which applications can be built? What are the deployment and operational concerns? In this talk, I'll address each of these questions in turn. As supporting evidence, both high-level application architecture and internal details will be discussed. This is an interactive talk: bring your questions and your use-cases!
HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the biggest and most exciting milestone release from the Apache community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Existing users of HBase/Phoenix as well as operators managing HBase clusters will benefit the most where they can learn about the new release and the long list of features. We will also briefly cover earlier 1.x release lines and compatibility and upgrade paths for existing users and conclude by giving an outlook on the next level of initiatives for the project.
This talk with give and overview of exciting two releases for Apache HBase and Phoenix. HBase 2.0 is the next stable major release for Apache HBase scheduled for early 2017. It is the next evolution from the Apache HBase community after 1.0. HBase-2.0 contains a large number of features that is long time in the development, some of which include rewritten region assignment, perf improvements (RPC, rewritten write pipeline, etc), async clients, C++ client, offheaping memstore and other buffers, Spark integration, shading of dependencies as well as a lot of other fixes and stability improvements. We will go into technical details on some of the most important improvements in the release, as well as what are the implications for the users in terms of API and upgrade paths. Phoenix 5.0 is the next biggest and most exciting milestone release because of Phoenix integration with Apache Calcite which ads lot of performance benefits with new query optimizer and helps to integrate with other data sources, especially those also based on calcite. It has lot of cool features such as Encoded columns, Kafka, Hive integration, improvements in secondary index rebuilding and many performance improvements.
Apache HBase Internals you hoped you Never Needed to UnderstandJosh Elser
Covers numerous internal features, concepts, and implementations of Apache HBase. The focus will be driven from an operational standpoint, investigating each component enough to understand its role in Apache HBase and the generic problems that each are trying to solve. Topics will range from HBase’s RPC system to the new Procedure v2 framework, to filesystem and ZooKeeper use, to backup and replication features, to region assignment and row locks. Each topic will be covered at a high-level, attempting to distill the often complicated details down to the most salient information.
In this session, you will learn the work Xiaomi has done to improve the availability and stability of our HBase clusters, including cross-site data and service backup and a coordinated compaction framework. You'll also learn about the Themis framework, which supports cross-row transactions on HBase based on Google's percolator algorithm, and its usage in Xiaomi's applications.
HBase Read High Availability Using Timeline-Consistent Region ReplicasHBaseCon
Speakers: Enis Soztutar and Devaraj Das (Hortonworks)
HBase has ACID semantics within a row that make it a perfect candidate for a lot of real-time serving workloads. However, single homing a region to a server implies some periods of unavailability for the regions after a server crash. Although the mean time to recovery has improved a lot recently, for some use cases, it is still preferable to do possibly stale reads while the region is recovering. In this talk, you will get an overview of our design and implementation of region replicas in HBase, which provide timeline-consistent reads even when the primary region is unavailable or busy.
Hbase and phoenix usage at eHarmony. Presented the lambda architecture and implementation of HBase and phoenix usage in eharmony at Apache PhoenixCon 2016.
In this talk, we discuss how three different types of data stores help eHarmony overcome challenges in scaling various parts of its matching and recommendations stack. We establish the challenge, identify requirements and discuss how specific types of data stores (Document/Key Value/Graph) help us overcome those challenges.
Transactions for Apache HBase™: Apache Tephra provides globally consistent transactions on top of Apache HBase. While HBase provides strong consistency with row- or region-level ACID operations, it sacrifices cross-region and cross-table consistency in favor of scalability. This trade-off requires application developers to handle the complexity of ensuring consistency when their modifications span region boundaries. By providing support for global transactions that span regions, tables, or multiple RPCs, Tephra simplifies application development on top of HBase, without a significant impact on performance or scalability for many workloads.
Speakers: Eli Levine, James Taylor (Salesforce.com) & Maryann Xue (Intel)
HBase is the Turing machine of the Big Data world. It's been scientifically proven that you can do *anything* with it. This is, of course, a blessing and a curse, as there are so many different ways to implement a solution. Apache Phoenix (incubating), the SQL engine over HBase to the rescue. Come learn about the fundamentals of Phoenix and how it hides the complexities of HBase while giving you optimal performance, and hear about new features from our recent release, including updatable views that share the same physical HBase table and n-way equi-joins through a broadcast hash join mechanism. We'll conclude with a discussion about our roadmap and plans to implement a cost-based query optimization to dynamically adapt query execution based on your data sizes.
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process.
Avoiding Chaos: Methodology for Managing Performance in a Shared Storage A...brettallison
Scope - The primary focus of this presentation is on the methodology we use for managing performance in a very large shared Storage Area Network environment with a Primary focus on Distributed Systems and IBM Enterprise Storage Server. The focus on this presentation is methodology and NOT measurement. There are numerous excellent presentations already out there on measurement. However, there are several references in the back of the presentation to measurement tools.
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is disruptive technology in the database space, bringing a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
You can gain substantially more business insights and save costs by migrating your existing data warehouse to Amazon Redshift. This session will cover the key benefits of migrating to Amazon Redshift, migration strategies, and tools and resources that can help you in the process. We’ll learn about AWS Database Migration Service and AWS Schema Migration Tool, which were recently enhanced to import data from six common data warehouse platforms.
Learn how Amazon Redshift, our fully managed, petabyte-scale data warehouse, can help you quickly and cost-effectively analyze all of your data using your existing business intelligence tools. Get an introduction to how Amazon Redshift uses massively parallel processing, scale-out architecture, and columnar direct-attached storage to minimize I/O time and maximize performance. Learn how you can gain deeper business insights and save money and time by migrating to Amazon Redshift. Take away strategies for migrating from on-premises data warehousing solutions, tuning schema and queries, and utilizing third party solutions.
Dive deep into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.
Performance Optimizations in Apache ImpalaCloudera, Inc.
Apache Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Hive or SPARK. Impala is written from the ground up in C++ and Java. It maintains Hadoop’s flexibility by utilizing standard components (HDFS, HBase, Metastore, Sentry) and is able to read the majority of the widely-used file formats (e.g. Parquet, Avro, RCFile).
To reduce latency, such as that incurred from utilizing MapReduce or by reading data remotely, Impala implements a distributed architecture based on daemon processes that are responsible for all aspects of query execution and that run on the same machines as the rest of the Hadoop infrastructure. Impala employs runtime code generation using LLVM in order to improve execution times and uses static and dynamic partition pruning to significantly reduce the amount of data accessed. The result is performance that is on par or exceeds that of commercial MPP analytic DBMSs, depending on the particular workload. Although initially designed for running on-premises against HDFS-stored data, Impala can also run on public clouds and access data stored in various storage engines such as object stores (e.g. AWS S3), Apache Kudu and HBase. In this talk, we present Impala's architecture in detail and discuss the integration with different storage engines and the cloud.
UNC309 - Getting the Most out of Microsoft Exchange Server 2010: Performance ...Louis Göhl
Selecting the right server hardware for an Exchange 2010 deployment becomes much easier when you know the product team's scalability and performance guidelines. This session provides a look at the product team's guidance for the processor and memory requirements of each server role in Exchange 2010. A number of key performance enhancements from this release are discussed, and you also learn about how to use related tools like the Exchange Storage Calculator, Exchange Profile Analyzer, Loadgen, and Jetstress to take the guesswork out of server sizing.
hbaseconasia2019 Phoenix Improvements and Practices on Cloud HBase at AlibabaMichael Stack
Yun Zhang
Track 2: Ecology and Solutions
https://open.mi.com/conference/hbasecon-asia-2019
THE COMMUNITY EVENT FOR APACHE HBASE™
July 20th, 2019 - Sheraton Hotel, Beijing, China
https://hbase.apache.org/hbaseconasia-2019/
In-Memory and TimeSeries Technology to Accelerate NoSQL Analyticssandor szabo
The ability of Informix to combine the in
-
memor
y
performance of Informix Warehouse Accelerator
and the flexibility of TimeSeries and NoSQL
analytics positions it to be ready for the IoT era.
[ACNA2022] Hadoop Vectored IO_ your data just got faster!.pdfMukundThakur22
Since 2006 the world of big data has moved from terabytes to hundreds of petabytes, from local clusters to remote cloud storage, yet the original Apache Hadoop POSIX-based file APIs have barely changed.
It is wonderful that these APIs have worked so well, but we can do a lot better with remote object stores, by providing new operations which suit them better, targeted at columnar data libraries such as ORC and Spark. Only a few libraries need to migrate to these APIs for significant speedups of all big data applications.
This talk introduces a new Hadoop Filesystem API called "vectored read", coming in Hadoop 3.4. An extension of the classic FSDataInputStream it is automatically offered by all filesystem clients.
The S3A connector is the first object store to provide a custom implementation, reading different blocks of data in parallel. In Apache Hive benchmarks with a modified ORC library, we saw a 2x speedup compared to using the classic s3a connector through the Posix APIs.
We will introduce the API spec, the S3A implementation, and the benchmarks, and show how to use it in your own applications. We will also cover our ongoing work on providing similar speedups with other object stores, and the use of the API in other applications.
Amazon Aurora is a MySQL-compatible relational database engine that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open source databases. Amazon Aurora is disruptive technology in the database space, bringing a new architectural model and distributed systems techniques to provide far higher performance, availability and durability than previously available using conventional monolithic database techniques. In this session, we will do a deep-dive into some of the key innovations behind Amazon Aurora, discuss best practices and configurations, and share early customer experience from the field.
Leveraging Open Source to Manage SAN Performancebrettallison
Scope - The primary focus of this presentation is how to leverage open source software to help in managing Shared Storage performance. The storage server will be the focus with particular emphasis on ESS. This solution is a small one-off solution.
Similar to Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster (20)
Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production.
Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process?
In this session we will cover:
- The Art of Effective Code Reviews
- Streamlining the Review Process
- Elevating Reviews with Automated Tools
By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces
Navigating the Metaverse: A Journey into Virtual Evolution"Donna Lenk
Join us for an exploration of the Metaverse's evolution, where innovation meets imagination. Discover new dimensions of virtual events, engage with thought-provoking discussions, and witness the transformative power of digital realms."
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Mind IT Systems
Healthcare providers often struggle with the complexities of chronic conditions and remote patient monitoring, as each patient requires personalized care and ongoing monitoring. Off-the-shelf solutions may not meet these diverse needs, leading to inefficiencies and gaps in care. It’s here, custom healthcare software offers a tailored solution, ensuring improved care and effectiveness.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Top 7 Unique WhatsApp API Benefits | Saudi ArabiaYara Milbes
Discover the transformative power of the WhatsApp API in our latest SlideShare presentation, "Top 7 Unique WhatsApp API Benefits." In today's fast-paced digital era, effective communication is crucial for both personal and professional success. Whether you're a small business looking to enhance customer interactions or an individual seeking seamless communication with loved ones, the WhatsApp API offers robust capabilities that can significantly elevate your experience.
In this presentation, we delve into the top 7 distinctive benefits of the WhatsApp API, provided by the leading WhatsApp API service provider in Saudi Arabia. Learn how to streamline customer support, automate notifications, leverage rich media messaging, run scalable marketing campaigns, integrate secure payments, synchronize with CRM systems, and ensure enhanced security and privacy.
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Globus
Large Language Models (LLMs) are currently the center of attention in the tech world, particularly for their potential to advance research. In this presentation, we'll explore a straightforward and effective method for quickly initiating inference runs on supercomputers using the vLLM tool with Globus Compute, specifically on the Polaris system at ALCF. We'll begin by briefly discussing the popularity and applications of LLMs in various fields. Following this, we will introduce the vLLM tool, and explain how it integrates with Globus Compute to efficiently manage LLM operations on Polaris. Attendees will learn the practical aspects of setting up and remotely triggering LLMs from local machines, focusing on ease of use and efficiency. This talk is ideal for researchers and practitioners looking to leverage the power of LLMs in their work, offering a clear guide to harnessing supercomputing resources for quick and effective LLM inference.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Launch Your Streaming Platforms in MinutesRoshan Dwivedi
The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown:
Pros of Speedy Streaming Platform Launch Services:
No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge.
Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker.
All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations.
Things to Consider:
Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions.
Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option.
Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options).
Examples of Services for Launching Streaming Platforms:
Muvi [muvi com]
Uscreen [usencreen tv]
Alternatives to Consider:
Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited.
Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform.
Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamtakuyayamamoto1800
In this slide, we show the simulation example and the way to compile this solver.
In this solver, the Helmholtz equation can be solved by helmholtzFoam. Also, the Helmholtz equation with uniformly dispersed bubbles can be simulated by helmholtzBubbleFoam.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Shahin Sheidaei
Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...informapgpstrackings
Keep tabs on your field staff effortlessly with Informap Technology Centre LLC. Real-time tracking, task assignment, and smart features for efficient management. Request a live demo today!
For more details, visit us : https://informapuae.com/field-staff-tracking/
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
In the ever-evolving landscape of technology, enterprise software development is undergoing a significant transformation. Traditional coding methods are being challenged by innovative no-code solutions, which promise to streamline and democratize the software development process.
This shift is particularly impactful for enterprises, which require robust, scalable, and efficient software to manage their operations. In this article, we will explore the various facets of enterprise software development with no-code solutions, examining their benefits, challenges, and the future potential they hold.
Enterprise Software Development with No Code Solutions.pptx
Five major tips to maximize performance on a 200+ SQL HBase/Phoenix cluster
1. Five major tips to maximize performance on
a 200+ SQL HBase/Phoenix cluster
Masayasu “Mas” Suzuki
Shinji Nagasaka
Takanari Tamesue
Sony Corporation
2. 2
Who we are, and why we chose HBase/Phoenix
We are DevOps members from
Sony’s News Suite team
– http://socialife.sony.net/
HBase/Phoenix was chosen
because of
– Scalability,
– SQL compatibility, and
– secondary indexing support
3. 3
Our use case
Internet
Sony News Suite Server Architecture
Application Server
HBase
Phoenix
EventHandler
HTTP
SQL (READ)
SQL (WRITE)
Fetcher
HTTP
End user
Outside content
providers
Main use case is caching
contents temporarily
4. 4
Basic test design
Query response time is measured as shown in red
Query read/write ratio is 6 to 1
12 different types of queries using eight separate indexes
Application Server
HBase
Phoenix
EventHandler
SQL (READ)
SQL (WRITE)
Fetcher
5. 5
Table schema
A table with 1.2 billion records were created
Each record is around 1.0 Kbytes
– Raw data is around 1.7 KBytes each
– Gzip is used to compress column pt and hence the total comes out to be around 1.0 Kbytes
id is the primary key
– Two MD5 hashed values are concatenated to create id
• Example: df461a2bda4002aaaa8117d4e43ee737_cfcd208495d565ef66e7dff9f98764da
CHAR(65)
id
VARCHAR
ai
VARCHAR
ao
DECIMAL
b
DECIMAL
c
CHAR(5)
cl
CHAR(2)
lg
DECIMAL
lw
DECIMAL
u
VARBINARY
pt
1adf… TR DSATE... 82122... 9071.9 true es 823.199 0.1243 (binary)
9d0a… FB Adad... 54011… 122114.5 true ja 23.632 5.22 (binary)
c5ae... KW 4 of … 20011… 3253.55 false fr 0.343 2.77 (binary)
ea4a... AB p7mj… 67691… 8901.0 true en 76.21 23.11 (binary)
1.2
billion
records
6. 6
Split points
Because it was impossible to store all 1.2 billion records on one single node, we
manually split the tables by defining the split points
Split points were set so that each divided block, or region file, would be nearly equal
in size
– This was possible because we knew
a. the exact range of our primary keys, and
b. the hashed values of our primary keys would be uniformly distributed
CREATE TABLE IF NOT EXISTS TBL_1200M_IDX_LZ4_VER1_SPLT200_PTBIN_INT2DEC (
id CHAR(65) NOT NULL,
ai VARCHAR, ao VARCHAR,
b DECIMAL, c DECIMAL,
cl CHAR(5), lg CHAR(2),
lw DECIMAL, u DECIMAL,
p_t VARBINARY,
CONSTRAINT my_pk PRIMARY KEY ( id )
) COMPRESSION='LZ4', VERSIONS='1', MAX_FILESIZE=26843545600 SPLIT ON
( '0148','0290','03d8','0520','0668','07b0','08f8','0a40', …,'fef8' );
7. 7
Distribution of region file per RegionServer
If split points can be evenly set, then data allocation can be evened out
Different color denotes
different tables
200 RegionServer
Total
size
per
node
8. 8
Queries
Ratio of R/W queries is 6 to 1
Sample READ queries
Sample WRITE queries
Constants (ex. in the above example, 228343239, or the value of b) were randomly
generated to simulate current production environment
SELECT id FROM TBL_1200M_IDX_LZ4_VER1_SPLT200_PTBIN_INT2DEC WHERE b=228343239 AND
cl='false';
SELECT id FROM TBL_1200M_IDX_LZ4_VER1_SPLT200_PTBIN_INT2DEC WHERE ai=‘AB' AND cl='false'
AND c>0 AND c<1417648603068;
/* Written as a Java PreparedStatement */
UPSERT INTO TBL_1200M_IDX_LZ4_VER1_SPLT200_PTBIN_INT2DEC (id,p_t,c,lw,u) VALUES (?,?,?,?,?)
9. 9
Queries – Details
Query
No.
Name Read/Write Percentage
generated
Description Randomly
generated part
1 Id READ 25% Search using primary key Id (primary key)
2 IdCnt READ 10% Count using primary key Id (primary key)
3 IdOr READ 10% Search using “OR” of ten primary keys Id (primary key)
4 AiAoU READ 5% Search using columns Ai, Ao, and U Ai, Ao, U
5 AiCCl READ 5% Search using columns Ai, C, and Cl Ai, C, Cl
6 AiLwCl READ 5% Search using columns Ai, Lw, and Cl Ai, Lw, Cl
7 AiULg READ 5% Search using columns Ai, U, and Lg Ai, U, Lg
8 BCl READ 5% Search using columns B and Cl B, Cl
9 BLg READ 5% Search using columns B and Lg B, Lg
10 CLg READ 5% Search using columns C and Lg C, Lg
11 LwLg READ 5% Search using columns Lw and Lg Lw, Lg
12 PtCLwU WRITE 15% Upsert binary data Pt and upsert columns C, Lw, and U Id (primary key), Pt, C, Lw, U
10. 10
Secondary indexes
Following eight indexes were created
Eight indexes are designed to be orthogonal indexes
Split points were manually set for index tables so that each region file would be
similar in size
Index
No.
Name Index type Description
1 AiAoU CHAR/CHAR/DECIMAL For use in search using columns Ai, Ao, and U
2 AiCCl CHAR/DECIMAL/CHAR For use in search using columns Ai, C, and Cl
3 AiLwCl CHAR/DECIMAL/CHAR For use in search using columns Ai, Lw, and Cl
4 AiULg CHAR/DECIMAL/CHAR For use in search using columns Ai, U, and Lg
5 BCl DECIMAL/CHAR For use in search using columns B and Cl
6 BLg DECIMAL/CHAR For use in search using columns B and Lg
7 CLg DECIMAL/CHAR For use in search using columns C and Lg
8 LwLg DECIMAL/CHAR For use in search using columns Lw and Lg
11. 11
Test environment
HBase Clusters
Zookeepers
HMasters
Zookeeper 1
Zookeeper 2
Zookeeper 3
HMaster
(Main)
HMaster
(Secondary)
HMaster
(Secondary Backup)
RegionServer
sRegionServer 1
Clients
Client 1
Client 2
Client 100
・・・
disk
RegionServer 1 disk
RegionServer 200 disk
SYSTEM.CATALOG
(Meta data for Phoenix Plug-in)
・・・
100 clients
(100 x c4.xlarge)
3 Zookeepers
(3 x m3.xlarge)
3 HMasters
(3 x m3.xlarge)
200 RegionServers
(199 x r3.xlarge)
(1 x c4.8xlarge)
12. 12
Tools used
Tools were especially useful for
– Pinpointing the bottlenecks in resource usage
– Determining when and where an error occurred within the cluster
– Verifying the effect of solutions applied
– Managing multiple nodes seamlessly without having to manage them separately
Tools used Purpose
Analysis of resource usage per AWS instance
(ex. CPU usage, network traffic, disk utilization, Java stats)
Analysis of status of HBase and Hadoop layers
(ex. number of regions, store files, requests)
Analysis of distribution of each HBase table over the cluster
(ex. number and size of region files per node)
Fabric
Remotely control multiple nodes via SSH
13. 13
Performance test apparatus & results
Test apparatus
Test results
Specs
Number of records 1.2 billion records (1KB each)
Number of indexes 8 orthogonal indexes
Servers
3 Zookeepers (Zookeeper 3.4.5, m3.xlarge x 3)
3 HMaster servers (hadoop 2.5.0, hbase 0.98.6, Phoenix 4.3.0, m3.xlarge x 3)
200 RegionServers
(hadoop 2.5.0, hbase 0.98.6, Phoenix 4.3.0, r3.xlarge x 199, c4.8xlarge x 1)
Clients 100 x c4.xlarge
Results
Number of queries 51,053 queries/sec
Response time (average) 46 ms
14. 14
Cost
Total: $325,236 (per year, “All Upfront” pricing)
This is a preliminary setup!
– There is room for further spec/cost optimization
Node Type Instance Type Quantity Cost (per year)
HBase:ZooKeeper m3.xlarge 3 $ 4,284
Hadoop:Name Node
HBase:Hmaster
m3.xlarge 3 $ 4,284
Hadoop:Data Node
HBase:RegionServer
r3.xlarge 199 $ 307,455
HBase:RegionServer
(for housing meta table
SYSTEM.CATALOG)
c4.8xlarge 1 $ 9,213
15. 15
Five major tips to maximize performance
using HBase/Phoenix
Ordered by effectiveness
16. 16
Tips 1 – Use SQL hint clause when using an index
Response without hint clause Response with hint clause
0
50
100
150
200
250
300
350
400
Id
IdCnt
IdOr
AiAoU
AiCCl
AiLwCl
AiULg
BCl
BLg
CLg
LwLg
PtCLwU
0.08
0.5
1
1.5
2
2.5
[ms]
Queries using
primary key
Write
query
Queries using index
Elapsed
time
[hours]
Performance improved by 6 times
0
50
100
150
200
250
300
350
400
Id
IdCnt
IdOr
AiAoU
AiCCl
AiLwCl
AiULg
BCl
BLg
CLg
LwLg
PtCLwU
0.08
0.5
1
1.5
2
2.5
[ms]
Queries using
primary key
Write
query
Queries using index
Elapsed
time
[hours]
17. 17
Tips 1 – Use SQL hint clause when using an index
Major possible cause (yet to be verified)
– When the index is used, an extra RPC is issued to verify latest meta/statistics
– Using hint clause may reduce this RPC (still hypothesis)
Other possible solutions
– Changing “UPDATE_CACHE_FREQUENCY” (available from Phoenix 4.7) may
resolve this issue (we have not tried this yet)
From Phoenix website …
https://phoenix.apache.org/#Altering
“When a SQL statement is run which references a table, Phoenix will by default check with the server to
ensure it has the most up to date table metadata and statistics. This RPC may not be necessary when you
know in advance that the structure of a table may never change.”
18. 18
Tips 2 – Use memories aggressively
In early stages of our testing, disk utilization and iowait of RegionServers
were extremely high
Test period Test period
iowait
19. 19
Tips 2 – Use memories aggressively
Issue was most critical during major compaction and index creation
Initially, we thought we had enough memory
– Total size of data (includes all tables/indexes and mirrored data in Hadoop layer)
• More than 1,360 GB
– Total available memory combined on RegionServers (then)
• Around 1,500 GB (m3.2xlarge(30GiB) x 50 nodes)
But this left very little margin for computation intensive tasks
We decided to allocate memory at least 3 times the size of data for added
protection and performance (has worked thus far)
20. 20
Tips 3 – Manually split the region file but don’t over split them
A single table is too big to be placed and managed by one single node
We wanted to know whether we should split in a “more finer” way or in a
“more coarser” way
21. 21
Tips 3 – Manually split the region file but don’t over split them
Comparison between 200 and 4002 split points
– 200 RegionServers were used in both cases
Don’t over split region files
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2 4 6 8 10 12 14 16
Volumeprocessed[queries/sec]
Elapsed time [H]
SplitPoint = 200 SplitPoint = 4002
0
100
200
300
400
500
600
700
0 2 4 6 8 10 12 14 16
Responsetime[ms]
Elapsed time [H]
SplitPoint = 200 SplitPoint = 4002
22. 22
Tips 4 – Scale-out instead of scale-up
Comparison of RegionServers running c3.4xlarge and c3.8xlarge
– c3.8xlarge is twice the spec of c3.4xlarge
– Combined computing power of “100 nodes of c3.4xlarge” is equal to “50 nodes
of c3.8xlarge”, but former scores better
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 2 4 6 8
Volumeprocessed[queries/sec]
Elapsed time [H]
c3.4xlarge x 100 c3.8xlarge x 50
0
20
40
60
80
100
120
140
160
180
200
0 2 4 6 8
Responsetime[ms]
Elapsed time [H]
c3.4xlarge x 100 c3.8xlarge x 50
Scale-out!
23. 23
Tips 5 – Avoid running power intense tasks simultaneously
For example, do not run major compaction together with index creation
Also, performance impact from major compaction can be lessened by
running them in smaller units
26,142
29,980
24000
25000
26000
27000
28000
29000
30000
31000
Volumeprocessed[queries/sec]
91 ms
80 ms
72
74
76
78
80
82
84
86
88
90
92
94
Responsetime[ms]
Major compaction
for nine tables done
simultaneously
Major compaction
for nine tables done
separately
Major compaction
for nine tables done
simultaneously
Major compaction
for nine tables done
separately
13% increase in
volume processed 9% faster
25. 25
First and foremost
Please understand that these are lessons learned through our tests on
our environment
Any one or all of these items may prove useful in your environment
26. 26
Items of limited success – Changing GC algorithm
RegionServers’ GC algorithm were changed and tested
Performance is more even with G1
Performance of G1 is on average, 2% lower than CMS
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 2 4 6 8 10
Volumeprocessed[queries/sec]
Elapsed time [H]
CMS G1
0
20
40
60
80
100
120
140
160
180
200
0 2 4 6 8 10
Responsetime[ms]
Elapsed time [H]
CMS G1
27. 27
Items of limited success – Changing Java heap size
RegionServers’ Java heap size were changed and tested
Maximum physical memory is 30.5 GiB (r3.xlarge)
When heap was set to 26.0 GB, system crashed after five hours
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 4 8 12 16
Volumeprocessed[queries/sec]
Elapsed time [H]
JavaHeap = 20.5GB JavaHeap = 23.0GB JavaHeap = 26.0GB
0
20
40
60
80
100
120
140
160
180
200
0 4 8 12 16
Responsetime[ms]
Elapsed time [H]
JavaHeap = 20.5GB JavaHeap = 23.0GB JavaHeap = 26.0GB
28. 28
Items of limited success – Changing disk file format
RegionServers’ disk file format was changed and tested
The newer xfs tend to score slightly better when compared at its highs
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
0 4 8 12 16
Volumeprocessed[queries/sec]
Elapsed time [H]
ext4 xfs
0
20
40
60
80
100
120
140
160
180
200
0 4 8 12 16
Responsetime[ms]
Elapsed time [H]
ext4 xfs
30. 30
Five major tips to maximize performance on HBase/Phoenix
Ordered by effectiveness (Most effective on the very top)
– An extra RPC is issued when the client runs a SQL statement that uses a secondary index
– Using SQL hint clause can mitigate this
– From Ver. 4.7, changing “UPDATE_CACHE_FREQUENCY” may also work (we have yet to test this)
– A memory rich node should be selected for use in RegionServers so as to minimize disk access
– More nodes running in parallel yield better results than fewer but powerful nodes running in parallel
– As an example, running major compaction and index creation simultaneously should be avoided
Tips 1. Use SQL hint clause when using a secondary index
Tips 2. Use memories aggressively
Tips 3. Manually split the region file if you can but never over split them
Tips 4. Scale-out instead of scale-up
Tips 5. Avoid running power intensive tasks simultaneously
31. 31
Special Thanks
Takafumi Suzuki
– Thank you very much for the countless and invaluable discussions
– We owe the success of this project to you!
Thank you very much!
32. “Sony” is a registered trademark of Sony Corporation.
Names of Sony products and services are the registered trademarks and/or trademarks of Sony Corporation or its Group companies.
Other company names and product names are the registered trademarks and/or trademarks of the respective companies.