HBase Technical Introduction. This deck includes a description of memory design, write path, read path, some operational tidbits, SQL on HBase (Phoenix and Hive), as well as HOYA (HBase on YARN).
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
Â
Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Â
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
The Apache⢠HadoopŽ project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Generic presentation about Big Data Architecture/Components. This presentation was delivered by David Pilato and Tugdual Grall during JUG Summer Camp 2015 in La Rochelle, France
Presentation on 2013-06-27, Workshop on the future of Big Data management, discussing hadoop for a science audience that are either HPC/grid users or people suddenly discovering that their data is accruing towards PB.
The other talks were on GPFS, LustreFS and Ceph, so rather than just do beauty-contest slides, I decided to raise the question of "what is a filesystem?", whether the constraints imposed by the Unix metaphor and API are becoming limits on scale and parallelism (both technically and, for GPFS and Lustre Enterprise in cost).
Then: HDFS as the foundation for the Hadoop stack.
All the other FS talks did emphasise their Hadoop integration, with the Intel talk doing the most to assert performance improvements of LustreFS over HDFSv1 in dfsIO and Terasort (no gridmix?), which showed something important: Hadoop is the application that add DFS developers have to have a story for
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Building a geospatial processing pipeline using Hadoop and HBase and how Mons...DataWorks Summit
Â
Monsanto built a geospatial platform on Hadoop and HBase capable of managing over 120 billion polygons. As a result of the extreme data volumes and compute complexities we were forced to migrate our data processing from a more traditional RDBMS to a scale out Hadoop implementation. Data processing that took over 30 days on 8% of the data now runs in under 12 hours on the entire data set. Very little concrete material exist for how you process spatial data via MapReduce or model it in HBase. We will provide concrete and novel examples for processing and storing spatial data on Hadoop and HBase. As part of the data processing pipeline we integrated the popular open source geospatial processing library GDAL with MapReduce to convert all geospatial datasets to a common format and projection. We developed a method for splitting and processing images via MapReduce in which the boundaries of splits needed to be shared by multiple tasks due to the nature of the computation being performed on the data. Bulk writes to HBase were performed by writing HFiles directly. Finally we developed a novel method for storing geospatial data in HBase that met the needs of our access pattern.
Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While developed by Facebook.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
Â
Learning Objectives - In this module, you will understand what is Big Data, What are the limitations of the existing solutions for Big Data problem; How Hadoop solves the Big Data problem, What are the common Hadoop ecosystem components, Hadoop Architecture, HDFS and Map Reduce Framework, and Anatomy of File Write and Read.
The Apache⢠HadoopŽ project develops open-source software for reliable, scalable, distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Generic presentation about Big Data Architecture/Components. This presentation was delivered by David Pilato and Tugdual Grall during JUG Summer Camp 2015 in La Rochelle, France
Presentation on 2013-06-27, Workshop on the future of Big Data management, discussing hadoop for a science audience that are either HPC/grid users or people suddenly discovering that their data is accruing towards PB.
The other talks were on GPFS, LustreFS and Ceph, so rather than just do beauty-contest slides, I decided to raise the question of "what is a filesystem?", whether the constraints imposed by the Unix metaphor and API are becoming limits on scale and parallelism (both technically and, for GPFS and Lustre Enterprise in cost).
Then: HDFS as the foundation for the Hadoop stack.
All the other FS talks did emphasise their Hadoop integration, with the Intel talk doing the most to assert performance improvements of LustreFS over HDFSv1 in dfsIO and Terasort (no gridmix?), which showed something important: Hadoop is the application that add DFS developers have to have a story for
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
Here's the second version of our big data landscape. Thoughts, questions, comments? We'd love to hear your feedback in the comments section here: http://wp.me/p2dLS7-6A
Chicago Data Summit: Apache HBase: An IntroductionCloudera, Inc.
Â
Apache HBase is an open source distributed data-store capable of managing billions of rows of semi-structured data across large clusters of commodity hardware. HBase provides real-time random read-write access as well as integration with Hadoop MapReduce, Hive, and Pig for batch analysis. In this talk, Todd will provide an introduction to the capabilities and characteristics of HBase, comparing and contrasting it with traditional database systems. He will also introduce its architecture and data model, and present some example use cases.
This report is intended primarily for business executives who are making important decisions with the results generated from data analysts and data scientists.
Social Networks and the Richness of Datalarsgeorge
Â
Social networks by their nature deal with large amounts of user-generated data that must be processed and presented in a time sensitive manner. Much more write intensive than previous generations of websites, social networks have been on the leading edge of non-relational persistence technology adoption. This talk presents how Germany's leading social networks Schuelervz, Studivz and Meinvz are incorporating Redis and Project Voldemort into their platform to run features like activity streams.
Hadoop is dead - long live Hadoop | BiDaTA 2013 Genoalarsgeorge
Â
Keynote during BiDaTA 2013 in Genoa, a special track of the ADBIS 2013 conference. URL: http://dbdmg.polito.it/bidata2013/index.php/keynote-presentation
From Batch to Realtime with Hadoop - Berlin Buzzwords - June 2012larsgeorge
Â
In the early days of web applications, sites were designed to serve users and gather information along the way. With the proliferation of data sources and growing user bases, the amount of data generated required new ways for storage and processing. Hadoop's HDFS and its batch oriented MapReduce opened new possibilities, yet it falls short of instant delivery of aggregate data to end users. Adding HBase and other layers, such as stream processing using Twitter's Storm, can overcome this delay and bridge the gap to realtime aggregation and reporting. This presentation takes the audience from the beginning of web application design to the current architecture, which combines multiple technologies to be able to process vast amounts of data, while still being able to react timely and report near realtime statistics.
http://berlinbuzzwords.de/sessions/batch-realtime-hadoop
HBase Applications - Atlanta HUG - May 2014larsgeorge
Â
HBase is good a various workloads, ranging from sequential range scans to purely random access. These access patterns can be translated into application types, usually falling into two major groups: entities and events. This presentation discussed the underlying implications and how to approach those use-cases. Examples taken from Facebook show how this has been tackled in real life.
Where Do We Put It All? Lessons Learned Housing Large Geospatial Data Collect...nacis_slides
Â
NACIS 2016 Presentation
Jo Ashley, OCUL Scholars Portal, University of Toronto Libraries
Amber Leahey, OCUL Scholars Portal, University of Toronto Libraries
The Ontario Council of University Libraries (OCUL) is a consortium of twenty-one university libraries in the province of Ontario, Canada that collaborates through collective purchasing and shared digital library infrastructure. OCUL's Scholars GeoPortal service (geo.scholarsportal.info) uses Esri software to provide a set of online tools for identifying, exploring, and downloading licensed geospatial datasets for academic research in Ontario. Since 2012, the usage and size of geospatial data collections housed and showcased in Scholars GeoPortal has grown significantly, with more than 220,000 site visits and over 140TB of data, resulting in a number of challenges. This session will introduce the GeoPortal's interface and discuss various data related issues and demands facing the current version of the geoportal, lessons learned, as well as future ideas and plans for continued success.
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
Â
A walk-thru of core Hadoop, the ecosystem tools, and Hortonworks Data Platform (HDP) followed by code examples in MapReduce (Java and C#), Pig, and Hive.
Presented at the Atlanta .NET User Group meeting in July 2014.
In this session you will learn:
What is Big Data?
What is Hadoop?
Overview of Hadoop Ecosystem
Hadoop Distributed File System or HDFS
Hadoop Cluster Modes
Yarn
MapReduce
Hive
Pig
Zookeeper
Flume
Sqoop
For more information, visit: https://www.mindsmapped.com/courses/big-data-hadoop/hadoop-developer-training-a-step-by-step-tutorial/
More and more organizations are moving their ETL workloads to a Hadoop based ELT grid architecture. Hadoop`s inherit capabilities, especially it`s ability to do late binding addresses some of the key challenges with traditional ETL platforms. In this presentation, attendees will learn the key factors, considerations and lessons around ETL for Hadoop. Areas such as pros and cons for different extract and load strategies, best ways to batch data, buffering and compression considerations, leveraging HCatalog, data transformation, integration with existing data transformations, advantages of different ways of exchanging data and leveraging Hadoop as a data integration layer. This is an extremely popular presentation around ETL and Hadoop.
In this session you will learn:
1. History of hadoop
2. Hadoop Ecosystem
3. Hadoop Animal Planet
4. What is Hadoop?
5. Distinctions of hadoop
6. Hadoop Components
7. The Hadoop Distributed Filesystem
8. Design of HDFS
9. When Not to use Hadoop?
10. HDFS Concepts
11. Anatomy of a File Read
12. Anatomy of a File Write
13. Replication & Rack awareness
14. Mapreduce Components
15. Typical Mapreduce Job
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of âBig Dataâ. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
2015 nov 27_thug_paytm_rt_ingest_brief_finalAdam Muise
Â
Paytm Labs provides a quick overview of their Hadoop data ingest platform. We cover our journey from a batch focused ingest system with SQOOP to a streaming ingest supported by Kafka, Confluent.io, Hadoop, Cassandra, and Spark Streaming. This presentation also provides an overview of our complete data platform including our feature creation template
Moving to a data-centric architecture: Toronto Data Unconference 2015Adam Muise
Â
Why use a datalake? Why use lambda? A conversation starter for Toronto Data Unconference 2015. We will discuss technologies such as Hadoop, Kafka, Spark Streaming, and Cassandra.
Creating a Data Science Team from an Architect's perspective. This is about team building on how to support a data science team with the right staff, including data engineers and devops.
An overview of securing Hadoop. Content primarily by Balaji Ganesan, one of the leaders of the Apache Argus project. Presented on Sept 4, 2014 at the Toronto Hadoop User Group by Adam Muise.
2014 feb 24_big_datacongress_hadoopsession2_moderndataarchitectureAdam Muise
Â
An introduction to Hadoop's core components as well as the core Hadoop use case: the Data Lake. This deck was delivered at Big Data Congress 2014 in Saint John, NB on Feb 24.
What is Hadoop brief intro for Georgian Partners CTO Conference. This outlines the origins of Open Source Apache Hadoop and how Hortonworks fits into this picture. There is also a brief introduction to YARN, the new resource negotiation layer.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
Â
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Â
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Â
Clients donât know what they donât know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clientsâ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Â
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
Â
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. Whatâs changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Â
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Welcome to the first live UiPath Community Day Dubai! Join us for this unique occasion to meet our local and global UiPath Community and leaders. You will get a full view of the MEA region's automation landscape and the AI Powered automation technology capabilities of UiPath. Also, hosted by our local partners Marc Ellis, you will enjoy a half-day packed with industry insights and automation peers networking.
đ Curious on our agenda? Wait no more!
10:00 Welcome note - UiPath Community in Dubai
Lovely Sinha, UiPath Community Chapter Leader, UiPath MVPx3, Hyper-automation Consultant, First Abu Dhabi Bank
10:20 A UiPath cross-region MEA overview
Ashraf El Zarka, VP and Managing Director MEA, UiPath
10:35: Customer Success Journey
Deepthi Deepak, Head of Intelligent Automation CoE, First Abu Dhabi Bank
11:15 The UiPath approach to GenAI with our three principles: improve accuracy, supercharge productivity, and automate more
Boris Krumrey, Global VP, Automation Innovation, UiPath
12:15 To discover how Marc Ellis leverages tech-driven solutions in recruitment and managed services.
Brendan Lingam, Director of Sales and Business Development, Marc Ellis
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
Â
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Enhancing Performance with Globus and the Science DMZGlobus
Â
ESnet has led the way in helping national facilitiesâand many other institutions in the research communityâconfigure Science DMZs and troubleshoot network issues to maximize data transfer performance. In this talk we will present a summary of approaches and tips for getting the most out of your network infrastructure using Globus Connect Server.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
Â
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more âmechanicalâ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
Â
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
Â
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Â
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
Â
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfPeter Spielvogel
Â
Building better applications for business users with SAP Fiori.
⢠What is SAP Fiori and why it matters to you
⢠How a better user experience drives measurable business benefits
⢠How to get started with SAP Fiori today
⢠How SAP Fiori elements accelerates application development
⢠How SAP Build Code includes SAP Fiori tools and other generative artificial intelligence capabilities
⢠How SAP Fiori paves the way for using AI in SAP apps
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Â
Sept 17 2013 - THUG - HBase a Technical Introduction
1. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase
Technical Deep Dive
Sept 17 2013 â Toronto Hadoop User Group
Adam Muise
amuise@hortonworks.com
2. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Deep Dive Agenda
â˘âŻBackground
ââŻ(how did we get here?)
â˘âŻHigh-level Architecture
ââŻ(where are we?)
â˘âŻAnatomy of a RegionServer
ââŻ(how does this thing work?)
â˘âŻUsing HBase
ââŻ(where do we go from here?)
Page 2
3. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Background
Page 3
4. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
So what is a BigTable anyway?
â˘âŻBigTable paper from Google, 2006, Dean et al.
ââŻâBigtable is a sparse, distributed, persistent multi-dimensional
sorted map.â
ââŻhttp://research.google.com/archive/bigtable.html
â˘âŻKey Features:
ââŻDistributed storage across cluster of machines
ââŻRandom, online read and write data access
ââŻSchemaless data model (âNoSQLâ)
ââŻSelf-managed data partitions
Page 4
5. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Modern Datasets Break Traditional Databases
Page 5
>⯠10x more always-connected mobile devices than seen in PC era.
>⯠Sensor, video and other machine generated data easily exceeds 100TB / day.
>⯠Traditional databases canât serve modern application needs.
6. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Apache HBase: The Database For Big Data
Page 6
More data is the key to richer application
experiences and deeper insights.
With HBase you can:
ĂźďźâŻ Ingest and retain more data, to petabyte scale and beyond.
ĂźďźâŻ Store and access huge data volumes with low latency.
ĂźďźâŻ Store data of any structure.
ĂźďźâŻ Use the entire Hadoop ecosystem to gain deep insight on your data.
7. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase At A Glance
Page 7
1
2
4
CLIENT LAYER
HBASE LAYER
HDFS LAYER
1
Clients automatically load
balanced across the cluster.
2
Scales linearly to handle any
load.
3
Data stored in HDFS allows
automated failover.
4
Analyze data with any Hadoop
tool.
3
8. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Real-Time Data on Hadoop
Page 8
>⯠Read, Write, Process and Query data in real time using Hadoop infrastructure.
9. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: High Availability
Page 9
>⯠Data safely protected in HDFS.
>⯠Failed nodes are automatically recovered.
>⯠No single point of failure, no manual intervention.
HBase NodeHBase Node
Replication Replication
HDFS HDFS HDFS
10. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Multi-Datacenter Replication
Page 10
>⯠Replicate data to 2 or more datacenters.
>⯠Load balancing or disaster recovery.
11. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase: Seamless Hadoop Integration
Page 11
>⯠HBase makes deep analytics simple using any Hadoop tool.
>⯠Query with Hive, process with Pig, classify with Mahout.
HDFS
12. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Apache Hadoop in Review
â˘âŻApache Hadoop Distributed Filesystem (HDFS)
ââŻDistributed, fault-tolerant, throughput-optimized data storage
ââŻUses a filesystem analogy, not structured tables
ââŻThe Google File System, 2003, Ghemawat et al.
ââŻhttp://research.google.com/archive/gfs.html
â˘âŻApache Hadoop MapReduce (MR)
ââŻDistributed, fault-tolerant, batch-oriented data processing
ââŻLine- or record-oriented processing of the entire dataset
ââŻâ[Application] schema on readâ
ââŻMapReduce: Simplified Data Processing on Large Clusters, 2004,
Dean and Ghemawat
ââŻhttp://research.google.com/archive/mapreduce.html
Page 12
For more on writing MapReduce applications, see âMapReduce
Patterns, Algorithms, and Use Casesâ
http://highlyscalable.wordpress.com/2012/02/01/mapreduce-patterns/
13. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
High-level Architecture
Page 13
14. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Architecture
â˘âŻ[Big]Tables consist of billions of rows, millions of
columns
â˘âŻRecords ordered by rowkey
ââŻInserts require sort, write-side overhead
ââŻApplications can take advantage of the sort
â˘âŻContinuous sequences of rows partitioned into
Regions
ââŻRegions partitioned at row boundary, according to size (bytes)
â˘âŻRegions automatically split when they grow too large
â˘âŻRegions automatically distributed around the cluster
ââŻâHands-free" partition management (mostly)
Page 14
15. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Architecture
Distributed, persistent partitions of a BigTable
a
b
d
c
e
f
h
g
i
j
l
k
m
n
p
o
Table A
Region 1
Region 2
Region 3
Region 4
Region Server 7
Table A, Region 1
Table A, Region 2
Table G, Region 1070
Table L, Region 25
Region Server 86
Table A, Region 3
Table C, Region 30
Table F, Region 160
Table F, Region 776
Region Server 367
Table A, Region 4
Table C, Region 17
Table E, Region 52
Table P, Region 1116
Legend:
- A single table is partitioned into Regions of roughly equal size.
- Regions are assigned to Region Servers across the cluster.
- Region Servers host roughly the same number of regions.
Page 15
16. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Physical Architecture
â˘âŻRegionServers collocate with DataNode
ââŻTight MapReduce integration
ââŻOpportunity for data-local online processing via coprocessors
(experimental)
â˘âŻHBase Master process manages Region assignment
â˘âŻZooKeeper configuration glue
â˘âŻClients communicate directly with RegionServers (data
path)
ââŻHorizontally scale client load
ââŻSignificantly harder for a single ignorant process to DOS the cluster
â˘âŻDDL operations clients communicate with HBase Master
â˘âŻNo persistent state in Master or ZooKeeper
ââŻRecover from HDFS snapshot
ââŻSee also: AWS Elastic MapReduce's HBase restore path
Page 16
17. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 17
Physical Architecture
Distribution and Data Path
...
Zoo
Keeper
Zoo
Keeper
Zoo
Keeper
HBase
Client
JavaApp
HBase
Client
JavaApp
HBase
Client
HBase Shell
HBase
Client
REST/Thrift
Gateway
HBase
Client
JavaApp
HBase
Client
JavaApp
Region
Server
Data
Node
Region
Server
Data
Node
...
Region
Server
Data
Node
Region
Server
Data
Node
HBase
Master
Name
Node
Legend:
- An HBase RegionServer is collocated with an HDFS DataNode.
- HBase clients communicate directly with Region Servers for sending and receiving data.
- HMaster manages Region assignment and handles DDL operations.
- Online conďŹguration state is maintained in ZooKeeper.
- HMaster and ZooKeeper are NOT involved in data path.
18. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Logical Data Model
â˘âŻTable as a sorted map of maps
{rowkey => {family => {qualifier => {version => value}}}}
ââŻThink: nested OrderedDictionary (C#), TreeMap (Java)
â˘âŻBasic data operations: GET, PUT, DELETE
â˘âŻSCAN over range of key-values
ââŻbenefit of the sorted rowkey business
ââŻthis is how you implement any kind of "complex queryâ
â˘âŻGET, SCAN support Filters
ââŻPush application logic to RegionServers
â˘âŻINCREMENT, CheckAnd{Put,Delete}
ââŻServer-side, atomic data operations
ââŻRequire read lock, can be contentious
â˘âŻNo: secondary indices, joins, multi-row transactions
Page 18
19. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 19
Logical Data Model
A sparse, multi-dimensional, sorted map
Legend:
- Rows are sorted by rowkey.
- Within a row, values are located by column family and qualiďŹer.
- Values also carry a timestamp; there can me multiple versions of a value.
- Within a column family, data is schemaless. QualiďŹers and values are treated as arbitrary bytes.
1368387247 [3.6 kb png data]"thumb"cf2b
a
cf1
1368394583 7
1368394261 "hello"
"bar"
1368394583 22
1368394925 13.6
1368393847 "world"
"foo"
cf2
1368387684 "almost the loneliest number"1.0001
1368396302 "fourth of July""2011-07-04"
Table A
rowkey
column
family
column
qualiďŹer
timestamp value
20. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Anatomy of a RegionServer
Page 20
21. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Storage Machinery
â˘âŻRegionServers host N Regions, as assigned by Master
ââŻCommon case, Region data is local to the RegionServer/DataNode
â˘âŻEach column family stored in isolation of others
ââŻ"column-family orientedâ storage
ââŻNOT the same as column-oriented storage
â˘âŻKey-values managed by "HStoreâ
ââŻcombined view over data on disk + in-memory edits
ââŻregion manages one HStore for each column family
â˘âŻOn disk: key-values stored sorted in "StoreFilesâ
ââŻStoreFiles composed of ordered sequence of "Blocksâ
ââŻalso carries BloomFilter to minimize Block access
â˘âŻIn memory: "MemStore" maintains heap of recent edits
ââŻnot to be confused with "BlockCacheâ
ââŻthis structure is essentially a log-structured merge tree (LSM-tree)*
with MemStore C0 and StoreFiles C1
Page 21
* http://staff.ustc.edu.cn/~jpq/paper/flash/1996-The%20Log-
Structured%20Merge-Tree%20%28LSM-Tree%29.pdf
22. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 22
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
- A RegionServer contains a single WAL, single BlockCache, and multiple Regions.
- A Region contains multiple Stores, one for each Column Family.
- A Store consists of multiple StoreFiles and a MemStore.
- A StoreFile corresponds to a single HFile.
- HFiles and WAL are persisted on HDFS.
Storage Machinery
Implementing the data model
23. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Write Path (Storage Machinery cont.)
â˘âŻWrite summary:
1.⯠Log edit to HLog (WAL)
2.⯠Record in MemStore
3.⯠ACK write
â˘âŻData events recorded to a WAL on HDFS, for durability
ââŻAfter fails, edits in WAL are replayed during recovery
ââŻWAL appends are immediate, in critical write-path
â˘âŻData collected in "MemStore", until a "flush" writes new
HFiles
ââŻFlush is automatic, based on configuration (size, or staleness interval)
ââŻFlush clears WAL entries corresponding to MemStore entries
ââŻFlush is deferred, not in critical write-path
â˘âŻHFiles are merge-sorted during "Compactionâ
ââŻSmall files compacted into larger files
ââŻold records discarded (major compaction only)
ââŻLots of disk and network IO
Page 23
24. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 24
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
1. A MutateRequest is received by the RegionServer.
2. A WALEdit is appended to the HLog.
3. The new KeyValues are written to the MemStore.
4. The RegionServer acknowledges the edit with a MutateResponse.
Write Path
Storing a KeyValue
1
2
3
4
25. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Read Path (Storage Machinery, cont.)
â˘âŻRead summary:
1.⯠Evaluate query predicate
2.⯠Materialize results from Stores
3.⯠Batch results to client
â˘âŻScanners opened over all relevant StoreFiles + MemStore
ââŻâBlockCacheâ maintains recently accessed Blocks in memory
ââŻBloomFilter used to skip irrelevant Blocks
ââŻPredicate matchs accumulate, sorted, return ordered rows
â˘âŻSame Scanner APIs used for GET and SCAN
ââŻDifferent access patterns, different optimization strategies
ââŻSCAN:
ââŻHDFS optimized for throughput of long sequential reads
ââŻConsider larger Block size for more data per seek
ââŻGET:
ââŻBlockCache maintains hot Blocks for point access (GET)
ââŻConsider more granular BloomFilter
Page 25
26. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 26
RegionServer
HDFS
HLog
(WAL)
HRegion
HStore
StoreFile
HFile
StoreFile
HFile
MemStore
...
...
HStore
BlockCache
HRegion
...
HStoreHStore
...
Legend:
1. A GetRequest is received by the RegionServer.
2. StoreScanners are opened over appropriate StoreFiles and the MemStore.
3. Blocks identiďŹed as potential matches are read from HDFS if not already in the BlockCache.
4. KeyValues are merged into the ďŹnal set of Results.
5. A GetResponse containing the Results is returned to the client.
Read Path
Serving a single read request
1 5
2
3
3
2
4
27. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Using HBase
Page 27
28. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
For what kinds of workloads is it well suited?
â˘âŻIt depends on how you tune it, butâŚ
â˘âŻHBase is good for:
ââŻLarge datasets
ââŻSparse datasets
ââŻLoosely coupled (denormalized) records
ââŻLots of concurrent clients
â˘âŻTry to avoid:
ââŻSmall datasets (unless you have *lots* of them)
ââŻHighly relational records
ââŻSchema designs requiring transactions
Page 28
29. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Use Cases
Page 29
Flexible
 Schema
 Huge
 Data
 Volume
Â
High
 Read
 Rate
 High
 Write
 Rate
Â
Machine-ÂâGenerated
Â
Data
Â
Distributed
 Messaging
Â
Real-ÂâTime
Â
Analy@cs
Â
Object
 Store
Â
User
 ProďŹle
Â
Management
Â
30. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hbase Example Use Case:
Major Hard Drive Manufacturer
Page 30
â˘âŻGoal: detect defective drives before they leave the
factory.
â˘âŻSolution:
ââŻStream sensor data to HBase as it is generated by their test
battery.
ââŻPerform real-time analysis as data is added and deep analytics
offline.
â˘âŻHBase a perfect fit:
ââŻScalable enough to accommodate all 250+ TB of data needed.
ââŻSeamless integration with Hadoop analytics tools.
â˘âŻResult:
ââŻWent from processing only 5% of drive test data to 100%.
31. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Other Example HBase Use Cases
â˘âŻFacebook messaging and counts
â˘âŻTime series data
â˘âŻExposing Machine Learning models (like risk sets)
â˘âŻLarge message set store and forward, especially in
social media
â˘âŻGeospatial indexing
â˘âŻIndexing the Internet
Page 31
32. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
How does it integrate with my infrastructure?
â˘âŻHorizontally scale application data
ââŻHighly concurrent, read/write access
ââŻConsistent, persisted shared state
ââŻDistributed online data processing via Coprocessors
(experimental)
â˘âŻGateway between online services and offline storage/
analysis
ââŻStaging area to receive new data
ââŻServe online âviewsâ on datasets in HDFS
ââŻGlue between batch (HDFS, MR1) and online (CEP, Storm)
systems
Page 32
33. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What data semantics does it provide?
â˘âŻGET, PUT, DELETE key-value operations
â˘âŻSCAN for queries
â˘âŻINCREMENT, CAS server-side atomic operations
â˘âŻRow-level write atomicity
â˘âŻMapReduce integration
Page 33
34. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Creating a table in HBase
#!/bin/sh
Â
#
 Small
 script
 to
 setup
 the
 hbase
 table
 used
 by
 OpenTSDB.
Â
Â
test
 -Âân
 "$HBASE_HOME"
 ||
 {
 #A
Â
echo
 >&2
 'The
 environment
 variable
 HBASE_HOME
 must
 be
 set'
Â
Â
exit
 1
 }
Â
Â
test
 -Ââd
 "$HBASE_HOME"
 ||
 {
Â
echo
 >&2
 "No
 such
 directory:
 HBASE_HOME=$HBASE_HOME"
Â
Â
exit
 1
 }
Â
Â
TSDB_TABLE=${TSDB_TABLE-Ââ'tsdb'}
 UID_TABLE=${UID_TABLE-Ââ'tsdb-Ââuid'}
 COMPRESSION=$
{COMPRESSION-Ââ'LZO'}
Â
Â
exec
 "$HBASE_HOME/bin/hbase"
 shell
 <<EOF
Â
create
 '$UID_TABLE',
 #B
 {NAME
 =>
 'id',
 COMPRESSION
 =>
 '$COMPRESSION'},
 #B
 {NAME
 =>
 'name',
Â
COMPRESSION
 =>
 '$COMPRESSION'}
 #B
Â
Â
create
 '$TSDB_TABLE',
 #C
 {NAME
 =>
 't',
 COMPRESSION
 =>
 '$COMPRESSION'}
 #C
Â
Â
Â
EOF
Â
Â
#A
 From
 environment,
 not
 parameter
Â
#B
 Make
 the
 tsdb-Ââuid
 table
 with
 column
 families
 id
 and
 name
Â
Â
#C
 Make
 the
 tsdb
 table
 with
 the
 t
 column
 family
Â
Â
Â
#Script
 taken
 from
 HBase
 in
 Action
 -Ââ
 Chapter
 7
Â
Page 34
35. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Coprocessors in a nutshell
â˘âŻ Two types of coprocessors: Observer and Endpoints
â˘âŻ Coprocessors are java code executed in each region server
â˘âŻ Observer
â⯠Similar to a database trigger
â⯠Available Observer types: RegionObserver, WALObserver, MasterObserver
â⯠Mainly used to extend pre/post logic within region server events, WAL events, or
DDL events
â˘âŻ Endpoint
â⯠Sort of like a UDF
â⯠Extend HBase client API to make functions exposed to a user
â⯠Still executed on RegionServer
â⯠Often used for sums/aggregations (HBase packs in an aggregate example)
â˘âŻBE VERY CAREFUL WITH COPROCESSORS
â⯠They run in your region servers and buggy code can take down your cluster
â⯠See HOYA details to help mitigate risk
Page 35
36. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
What about operational concerns?
â˘âŻBalance memory and IO for reads
ââŻContention between random and sequential access
ââŻConfigure Block size, BlockCache based on access patterns
ââŻAdditional resources
ââŻâHBase: Performance Tuners,â http://labs.ericsson.com/blog/hbase-
performance-tuners
ââŻâScanning in HBase,â http://hadoop-hbase.blogspot.com/2012/01/
scanning-in-hbase.html
â˘âŻBalance IO for writes
ââŻProvision hardware with more spindles/TB
ââŻConfigure L1 (compactions, region size, &c.) based on write pattern
ââŻBalance contention between maintaining L1 and serving reads
ââŻAdditional resources
ââŻâConfiguring HBase Memstore: what you should know,â http://
blog.sematext.com/2012/07/16/hbase-memstore-what-you-should-know/
ââŻâVisualizing HBase Flushes And Compactions,â http://www.ngdata.com/
visualizing-hbase-flushes-and-compactions/
Page 36
37. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Operational Tidbits
â˘âŻDecommissioning Nodes will result in a downed
server, use âgraceful_stop.shâ to offload the workload
from the region server
â˘âŻUse the âzk_dumpâ to find all of your region servers
and how your zookeeper instances are faring
â˘âŻUse âstatus âsummaryââ or âstatus âdetailedââ for a
count of live/dead servers, average load, and file
counts
â˘âŻUser âbalancerâ to automatically balance regions if
HBase is set to auto-balance
â˘âŻWhen using âhbase hbckâ to diagnose and fix issues,
RTFM!
Page 37
38. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
SQL and HBase
Hive and Phoenix over HBase
Page 38
39. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Phoenix over HBase
â˘âŻPhoenix is a SQL shim over HBase
â˘âŻhttps://github.com/forcedotcom/phoenix
â˘âŻHbase has fast write capabilities to
Phoenix allows for fast simple query (no
joins) and fast upserts
â˘âŻPhoenix implements itâs own JDBC
driver so you can use your favorite tools
Page 39
40. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.Š Hortonworks Inc. 2013
Phoenix over HBase
Page 40
41. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive over HBase
â˘âŻHive can be used directly with HBase
â˘âŻHive uses the MapReduce InputFormat
âHBaseStorageHandlerâ to query from the table
â˘âŻStorage Handler has hooks for
ââŻGetting input / output formats
ââŻMeta data operations hook: CREATE TABLE, DROP TABLE, etc
â˘âŻStorage Handler is a table level concept
ââŻDoes not support Hive partitions, and buckets
â˘âŻHive does not need to include all columns from HBase
table
Page 41
42. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.Š Hortonworks Inc. 2013
Hive over HBase
Page 42
43. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive over HBase
Page 43
44. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hive and Phoenix over HBase
> hive
add jar /usr/lib/hbase/hbase-0.94.6.1.3.0.0-107-security.jar;
add jar /usr/lib/hbase/lib/zookeeper.jar;
add jar /usr/lib/hbase/lib/protobuf-java-2.4.0a.jar;
add jar /usr/lib/hive/lib/hive-hbase-handler-0.11.0.1.3.0.0-107.jar; set hbase.zookeeper.quorum=node1.hadoop;
CREATE EXTERNAL TABLE phoenix_mobilelograw( key string,
ip string,
ts string,
code string,
d1 string,
d2 string,
d3 string,
d4 string, properties string )
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = ":key,F:IP,F:TS,F:CODE,F:D1,F:D2,F:D3,F:D4,F:PROPERTIES") TBLPROPERTIES
("hbase.table.name" = "MOBILELOGRAWâ);
set hive.hbase.wal.enabled=false;
INSERT OVERWRITE TABLE phoenix_mobilelograw SELECT * FROM hive_mobilelograw; set
hive.hbase.wal.enabled=true;
Page 44
45. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hbase Roadmap
Page 45
46. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Hortonworks Focus Areas for HBase
Page 46
â˘âŻ Simplified Operations:
â˘âŻ Intelligent Compaction
â˘âŻ Automated Rebalancing
â˘âŻ Ambari Management:
â˘âŻ Snapshot / Revert
â˘âŻ Multimaster HA
â˘âŻ Cross-site Replication
â˘âŻ Backup / Restore
â˘âŻ Ambari Monitoring:
â˘âŻ Latency metrics
â˘âŻ Throughput metrics
â˘âŻ Heatmaps
â˘âŻ Region visualizations
Simplified Operations Database Functionality
â˘âŻ First-Class Datatypes
â˘âŻ SQL Interface Support
â˘âŻ Indexes
â˘âŻ Security
â˘âŻ Encryption
â˘âŻ More Granular Permissions
â˘âŻ Performance:
â˘âŻ Stripe Compactions
â˘âŻ Short Circuit Read for
Hadoop 2
â˘âŻ Row and Entity Groups
â˘âŻ Deeper Hive/Pig Interop
47. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Roadmap Details: Operations
Page 47
â˘âŻSnapshots:
ââŻProtect data or restore to a point in time.
â˘âŻIntelligent Compaction:
ââŻCompact when the system is lightly utilized.
ââŻAvoid âcompaction stormsâ that can break SLAs.
â˘âŻAmbari Operational Improvements:
ââŻConfigure multi-master HA.
ââŻSimple setup/configuration for replication.
ââŻManage and schedule snapshots.
ââŻMore visualizations, more health checks.
48. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HBase Roadmap Details: Data Management
Page 48
â˘âŻDatatypes:
ââŻFirst-class datatypes offer performance benefits and better
interoperability with tools and other databases.
â˘âŻSQL Interface (Preview):
ââŻSQL interface for simplified analysis of data within HBase.
ââŻJDBC driver allows embedding in existing applications.
â˘âŻSecurity:
ââŻGranular permissions on data within HBase.
49. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HOYA
HBase On YARN
Page 49
50. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
HOYA?
â˘âŻ The new YARN resource negotiation layer in Hadoop allows for
non-mapreduce applications to run on a Hadoop grid, why not
allow HBase to take advantage of this capability?
â˘âŻ https://github.com/hortonworks/hoya/
â˘âŻ HOYA is a YARN application that provisions regionservers based
on an HBase cluster configuration
â˘âŻ HOYA helps to bring HBase into YARN resource management
and paves the way for advanced resource management with
HBase
â˘âŻ HOYA can be used to spin up temporary HBase clusters
temporarily during MapReduce or other jobs
Page 50
51. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
A quick YARN refresherâŚ
Page 51
52. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.Š Hortonworks Inc. 2013
The 1st Generation of Hadoop: Batch
HADOOP 1.0
Built for Web-Scale Batch Apps
Single
 App
Â
BATCH
HDFS
Single
 App
Â
INTERACTIVE
Single
 App
Â
BATCH
HDFS
â˘âŻ All other usage
patterns must
leverage that same
infrastructure
â˘âŻ Forces the creation
of silos for managing
mixed workloads
Single
 App
Â
BATCH
HDFS
Single
 App
Â
ONLINE
53. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.Š Hortonworks Inc. 2013
A Transition From Hadoop 1 to 2
HADOOP 1.0
HDFS
Â
(redundant,
 reliable
 storage)
Â
MapReduce
Â
(cluster
 resource
 management
Â
 &
 data
 processing)
Â
54. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.Š Hortonworks Inc. 2013
A Transition From Hadoop 1 to 2
HADOOP 1.0
HDFS
Â
(redundant,
 reliable
 storage)
Â
MapReduce
Â
(cluster
 resource
 management
Â
 &
 data
 processing)
Â
HDFS
Â
(redundant,
 reliable
 storage)
Â
YARN
Â
(cluster
 resource
 management)
Â
MapReduce
Â
(data
 processing)
Â
Others
Â
(data
 processing)
Â
HADOOP 2.0
55. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
The Enterprise Requirement: Beyond Batch
To become an enterprise viable data platform, customers have
told us they want to store ALL DATA in one place and interact with
it in MULTIPLE WAYS
Simultaneously & with predictable levels of service
Page 55
HDFS
 (Redundant,
 Reliable
 Storage)
Â
BATCH
 INTERACTIVE
 STREAMING
 GRAPH
 IN-ÂâMEMORY
 HPC
 MPI
 ONLINE
 OTHER
Â
56. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
YARN: Taking Hadoop Beyond Batch
â˘âŻ Created to manage resource needs across all uses
â˘âŻ Ensures predictable performance & QoS for all apps
â˘âŻ Enables apps to run âINâ Hadoop rather than âONâ
ââŻKey to leveraging all other common services of the Hadoop platform:
security, data lifecycle management, etc.
Page 56
ApplicaDons
 Run
 NaDvely
 IN
 Hadoop
Â
HDFS2
 (Redundant,
 Reliable
 Storage)
Â
YARN
 (Cluster
 Resource
 Management)
Â
Â
Â
BATCH
Â
(MapReduce)
Â
INTERACTIVE
Â
(Tez)
Â
STREAMING
Â
(Storm,
 S4,âŚ)
Â
GRAPH
Â
(Giraph)
Â
IN-ÂâMEMORY
Â
(Spark)
Â
HPC
 MPI
Â
(OpenMPI)
Â
ONLINE
Â
(HBase)
Â
OTHER
Â
(Search)
Â
(WeaveâŚ)
Â
57. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.Š Hortonworks Inc. 2013
HOYA Architecture
Page 57
58. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Key HOYA Design Goals
1.⯠Create on-demand HBase clusters
2.⯠Maintain multiple HBase cluster configurations and
implement them as required (i.e. high-load
scenarios)
3.⯠Isolation â Sandbox clusters running different
versions of HBase or with different coprocessors
4.⯠Create transient HBase clusters for MapReduce or
other processing
5.⯠Elasticity of clusters for analytics, data-ingest,
project-based work
6.⯠Leverage the scheduling in YARN to ensure HBase
can be a good Hadoop cluster tenant
Page 58
59. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Page 59
Time to call it an
evening. We all have
important work to
doâŚ
60. Deep Dive content by Hortonworks, Inc. is licensed under a
Creative Commons Attribution-ShareAlike 3.0 Unported License.
Thank youâŚ.
Page 60
hbaseinaction.com
For more information, check out
HBase: The Definitive Guide
Or
HBase in Action