HDFS scalability and availability is limited by the single namespace server design. Giraffa is an experimental file system, which uses HBase to maintain the file system namespace in a distributed way and serves data directly from HDFS DataNodes. Giraffa is intended to provide higher scalabilty, availability, and maintain very large namespaces. The presentation will explain the Giraffa architecture, the motivation, will address its main challenges, and give an update on the status of the project.
Presenter: Konstantin Shvachko (PhD), Founder, AltoScale
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. This presentation will describe the features and implementation of HDFS Federation scheduled for release with Hadoop-0.23.
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksCloudera, Inc.
Scalability of the NameNode has been a key issue for HDFS clusters. Because the entire file system metadata is stored in memory on a single NameNode, and all metadata operations are processed on this single system, the NameNode both limits the growth in size of the cluster and makes the NameService a bottleneck for the MapReduce framework as demand increases. This presentation will describe the features and implementation of HDFS Federation scheduled for release with Hadoop-0.23.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
Hadoop 0.23 contains major architectural changes in both HDFS and Map-Reduce frameworks. The fundamental changes include HDFS (Hadoop Distributed File System) Federation and YARN (Yet Another Resource Negotiator) to overcome the current scalability limitations of both HDFS and Job Tracker. Despite major architectural changes, the impact on user applications and programming model has been kept to a minimal to ensure that existing user Hadoop applications written in Hadoop 20 will continue to function with minimal changes. In this talk we will discuss the architectural changes which Hadoop 23 introduces and compare it to Hadoop 20. Since this is the biggest major release of Hadoop that has been adopted at Yahoo! (after Hadoop 20) in 3 years, we will talk about the customer impact and potential deployment issues of Hadoop 23 and its ecosystems. The deployment of Hadoop 23 at Yahoo! is an ongoing process and is being conducted in a phased manner on our clusters.
Presenter: Viraj Bhat, Principal Engineer, Yahoo!
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
WANdisco is a provider of non-stop software for global enterprises to meet the challenges of Big Data and distributed software development.
KEY HIGHLIGHTS, Session 1: Tuesday, Feb. 26, 5:15 p.m.-6 p.m.
Hadoop and HBase on the Cloud: A Case Study on Performance and Isolation
Cloud infrastructure is a flexible tool to orchestrate multiple Hadoop and HBase clusters, which provides strict isolation of data and compute resources for multiple customers. Most importantly, our benchmarks show that virtualized environment allows for higher average utilization of per-node resources. For more session information, visit http://na.apachecon.com/schedule/presentation/131/.
CO-PRESENTERS, Dr. Konstantin V. Shvachko, Chief Architect, Big Data, WANdisco and Jagane Sundar, CTO/VP Engineering, Big Data, WANdisco
A veteran Hadoop developer and respected author, Konstantin Shvachko is a technical expert specializing in efficient data structures and algorithms for large-scale distributed storage systems. Konstantin joined WANdisco through the AltoStor acquisition and before that he was founder and Chief Scientist at AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. Konstantin played a lead architectural role at eBay, building two generations of the organization's Hadoop platform. At Yahoo!, he worked on the Hadoop Distributed File System (HDFS). Konstantin has dozens of publications and presentations to his credit and is currently a member of the Apache Hadoop PMC. Konstantin has a Ph.D. in Computer Science and M.S. in Mathematics from Moscow State University, Russia.
Jagane Sundar has extensive big data, cloud, virtualization, and networking experience and joined WANdisco through its AltoStor acquisition. Before AltoStor, Jagane was founder and CEO of AltoScale, a Hadoop and HBase-as-a-Platform company acquired by VertiCloud. His experience with Hadoop began as Director of Hadoop Performance and Operability at Yahoo! Jagane has such accomplishments to his credit as the creation of Livebackup, development of a user mode TCP Stack for Precision I/O, development of the NFS and PPP clients and parts of the TCP stack for JavaOS for Sun MicroSystems, and more. Jagane received his B.E. in Electronics and Communications Engineering from Anna University.
About WANdisco
WANdisco ( LSE : WAND ) is a provider of enterprise-ready, non-stop software solutions that enable globally distributed organizations to meet today's data challenges of secure storage, scalability and availability. WANdisco's products are differentiated by the company's patented, active-active data replication technology, serving crucial high availability (HA) requirements, including Hadoop Big Data and Application Lifecycle Management (ALM). Fortune Global 1000 companies including AT&T, Motorola, Intel and Halliburton rely on WANdisco for performance, reliability, security and availability. For additional information, please visit www.wandisco.com.
Hadoop 0.23 contains major architectural changes in both HDFS and Map-Reduce frameworks. The fundamental changes include HDFS (Hadoop Distributed File System) Federation and YARN (Yet Another Resource Negotiator) to overcome the current scalability limitations of both HDFS and Job Tracker. Despite major architectural changes, the impact on user applications and programming model has been kept to a minimal to ensure that existing user Hadoop applications written in Hadoop 20 will continue to function with minimal changes. In this talk we will discuss the architectural changes which Hadoop 23 introduces and compare it to Hadoop 20. Since this is the biggest major release of Hadoop that has been adopted at Yahoo! (after Hadoop 20) in 3 years, we will talk about the customer impact and potential deployment issues of Hadoop 23 and its ecosystems. The deployment of Hadoop 23 at Yahoo! is an ongoing process and is being conducted in a phased manner on our clusters.
Presenter: Viraj Bhat, Principal Engineer, Yahoo!
The data management industry has matured over the last three decades, primarily based on relational database management system(RDBMS) technology. Since the amount of data collected, and analyzed in enterprises has increased several folds in volume, variety and velocityof generation and consumption, organisations have started struggling with architectural limitations of traditional RDBMS architecture. As a result a new class of systems had to be designed and implemented, giving rise to the new phenomenon of “Big Data”. In this paper we will trace the origin of new class of system called Hadoop to handle Big data.
More about Hadoop
www.beinghadoop.com
https://www.facebook.com/hadoopinfo
This PPT Gives information about
Complete Hadoop Architecture and
information about
how user request is processed in Hadoop?
About Namenode
Datanode
jobtracker
tasktracker
Hadoop installation Post Configurations
The Hadoop Cluster Administration course at Edureka starts with the fundamental concepts of Apache Hadoop and Hadoop Cluster. It covers topics to deploy, manage, monitor, and secure a Hadoop Cluster. You will learn to configure backup options, diagnose and recover node failures in a Hadoop Cluster. The course will also cover HBase Administration. There will be many challenging, practical and focused hands-on exercises for the learners. Software professionals new to Hadoop can quickly learn the cluster administration through technical sessions and hands-on labs. By the end of this six week Hadoop Cluster Administration training, you will be prepared to understand and solve real world problems that you may come across while working on Hadoop Cluster.
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
Todd Lipcon explains why you should be interested in Apache Hadoop, what it is, and how it works. Todd also brings to light the Hadoop ecosystem and real business use cases that evolve around Hadoop and the ecosystem.
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
HDFS is a Java-based file system that provides scalable and reliable data storage, and it was designed to span large clusters of commodity servers. HDFS has demonstrated production scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures.
Similar to Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger (20)
Presented at the SPIFFE Meetup in Tokyo.
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures.
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
Athenz (www.athenz.io) is an open source platform for X.509 certificate-based service authentication and fine-grained access control in dynamic infrastructures that provides options to run multi-environments with a single access control model.
Jithin Emmanuel, Sr. Software Development Manager, Developer Platform Services, provides an overview of Screwdriver (http://www.screwdriver.cd), and shares how it’s used at scale for CI/CD at Oath. Jithin leads the product development and operations of Screwdriver, which is a flagship CI/CD product used at scale in Oath.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request?
This presentation introduces Vespa (http://vespa.ai) – the open source big data serving engine.
Vespa allows you to search, organize, and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents and was recently open sourced at http://vespa.ai.
In recent times, YARN Capacity Scheduler has improved a lot in terms of some critical features and refactoring. Here is a quick look into some of the recent changes in scheduler:
Global Scheduling Support
General placement support
Better preemption model to handle resource anomalies across and within queue.
Absolute resources’ configuration support
Priority support between Queues and Applications
In this talk, we will deep dive into each of these new features to give a better picture of their usage and performance comparison. We will also provide some more brief overview about the ongoing efforts and how they can help to solve some of the core issues we face today.
Speakers:
Sunil Govind (Hortonworks), Jian He (Hortonworks)
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
In recent years, Yahoo has brought the big data ecosystem and machine learning together to discover mathematical models for search ranking, online advertising, content recommendation, and mobile applications. We use distributed computing clusters with CPUs and GPUs to train these models from 100’s of petabytes of data.
A collection of distributed algorithms have been developed to achieve 10-1000x the scale and speed of alternative solutions. Our algorithms construct regression/classification models and semantic vectors within hours, even for billions of training examples and parameters. We have made our distributed deep learning solutions, CaffeOnSpark and TensorFlowOnSpark, available as open source.
In this talk, we highlight Yahoo use cases where big data and machine learning technologies are best exemplified. We explain algorithm/system challenges to scale ML algorithms for massive datasets. We provide a technical overview of CaffeOnSpark and TensorFlowOnSpark to jumpstart your journey of large-scale machine learning.
Speakers:
Andy Feng is a VP of Architecture at Yahoo, leading the architecture and design of big data and machine learning initiatives. He has architected large-scale systems for personalization, ad serving, NoSQL, and cloud infrastructure. Prior to Yahoo, he was a Chief Architect at Netscape/AOL, and Principal Scientist at Xerox. He received a Ph.D. degree in computer science from Osaka University, Japan.
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
Spark and SQL-on-Hadoop have made it easier than ever for enterprises to create or migrate apps to the big data stack. Thousands of apps are being generated every day in the form of ETL and modeling pipelines, business intelligence and data cubes, deep machine learning, graph analytics, and real-time data streaming. However, the task of reliably operationalizing these big data apps involves many painpoints. Developers may not have the experience in distributed systems to tune apps for efficiency and performance. Diagnosing failures or unpredictable performance of apps can be a laborious process that involves multiple people. Apps may get stuck or steal resources and cause mission-critical apps to miss SLAs.
This talk with introduce the audience to these problems and their common causes. We will also demonstrate how to find and fix these problems quickly, as well as prevent such problems from happening in the first place.
Speakers:
Dr. Shivnath Babu is a Co-founder and CTO of Unravel and Associate Professor of Computer Science at Duke University. With more than a decade of experience researching the ease of use and manageability of data-intensive systems, he leads the Starfish project at Duke, which pioneered the automation of Hadoop application tuning, problem diagnosis, and resource management. Shivnath has more than 80 peer-reviewed publications to his credit and has received the U.S. National Science Foundation CAREER Award, the HP Labs Innovation Award, and three IBM Faculty Awards.
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
Apache Apex (http://apex.apache.org/) is a stream processing platform that helps organizations to build processing pipelines with fault tolerance and strong processing guarantees. It was built to support low processing latency, high throughput, scalability, interoperability, high availability and security. The platform comes with Malhar library - an extensive collection of processing operators and a wide range of input and output connectors for out-of-the-box integration with an existing infrastructure. In the talk I am going to describe how connectors together with the distributed checkpointing (a mechanism used by the Apex to support fault tolerance and high availability) provide exactly-once end-to-end processing guarantees.
Speakers:
Vlad Rozov is Apache Apex PMC member and back-end engineer at DataTorrent where he focuses on the buffer server, Apex platform network layer, benchmarks and optimizing the core components for low latency and high throughput. Prior to DataTorrent Vlad worked on distributed BI platform at Huawei and on multi-dimensional database (OLAP) at Hyperion Solutions and Oracle.
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
In the analysis of big data there are problematic queries that don’t scale because they require huge compute resources and time to generate exact results. Examples include count distinct, quantiles, most frequent items, joins, matrix computations, and graph analysis. If approximate results are acceptable, there is a class of sub-linear, stochastic streaming algorithms, called "sketches", that can produce results orders-of magnitude faster and with mathematically proven error bounds. For interactive queries there may not be other viable alternatives, and in the case of extracting results for these problem queries in real-time, sketches are the only known solution. For any analysis system that requires these problematic queries from big data, sketches are a required toolkit that should be tightly integrated into the system's analysis capabilities. This technology has helped Yahoo successfully reduce data processing times from days to hours, or minutes to seconds on a number of its internal platforms. This talk covers the current state of our Open Source DataSketches.github.io library, which includes adaptations and example code for Pig, Hive, Spark and Druid and gives architectural examples of use and a case study.
Speakers:
Jon Malkin is a scientist at Yahoo working to extend the DataSketches library. His previous roles have involved large scale data processing for sponsored search, display advertising, user counting, ad targeting, and cross-device user identity modeling.
Alexander Saydakov is a senior software engineer at Yahoo working on the open source Data Sketches project. In his previous roles he has been involved in building large-scale back-end data processing systems and frameworks for data analytics and experimentation based on Torque, Hadoop, Pig, Hive and Druid. Alexander’s education background is in the field of applied mathematics.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
UiPath Test Automation using UiPath Test Suite series, part 6
Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger
1. The Giraffa File System
Konstantin V. Shvachko
Alto Storage Technologies
Storage
September 19, 2012 Hadoop User Group
AltoStor
2. AltoStor
Giraffa
Giraffa is a distributed,
highly available file system
Utilizes features of
HDFS and HBase
New open source project
in experimental stage
2
3. AltoStor
Apache Hadoop
A reliable, scalable, high performance distributed
storage and computing system
The Hadoop Distributed File System (HDFS)
Reliable storage layer
MapReduce – distributed computation framework
Simple computational model
Ecosystem of Big Data tools
HBase, Zookeeper
3
4. AltoStor
The Design Principles
Linear scalability
More nodes can do more work within the same time
On Data size and Compute resources
Reliability and Availability
1 drive fails in 3 years. Probability of failing today 1/1000.
Several drives fail every day on a cluster with thousands of drives
Move computation to data
Minimize expensive data transfers
Sequential data processing
Avoid random reads. [Use HBase for random data access]
4
5. AltoStor
Hadoop Cluster
HDFS – a distributed file system
NameNode – namespace and block management
DataNodes – block replica container
MapReduce – a framework for distributed computations
JobTracker – job scheduling, resource management, lifecycle
coordination
TaskTracker – task execution module
NameNode JobTracker
TaskTracker TaskTracker TaskTracker
DataNode DataNode DataNode
5
6. AltoStor
Hadoop Distributed File System
The namespace is a hierarchy of files and directories
Files are divided into large blocks 128 MB
Namespace (metadata) is decoupled from data
Fast namespace operations, not slowed down by
Direct data streaming from the source storage
Single NameNode keeps entire namespace in RAM
DataNodes store block replicas as files on local drives
Blocks replicated on 3 DataNodes for redundancy & availability
HDFS client – point of entry to HDFS
Contacts NameNode for metadata
Serves data to applications directly from DataNodes
6
7. AltoStor
Scalability Limits
Single-master architecture: a constraining resource
Single NameNode limits linear performance growth
A handful of “bad” clients can saturate NameNode
Single point of failure: takes whole cluster out of service
NameNode space limit
100 million files and 200 million blocks with 64GB RAM
Restricts storage capacity to 20 PB
Small file problem: block-to-file ratio is shrinking
“HDFS Scalability: The limits to growth” USENIX ;login: 2010
7
8. AltoStor
Node Count Visualization
2008 Yahoo!
Resources per node: Cores, Disks, RAM
4000-node cluster
2010 Facebook
2000 nodes
2011 eBay
1000 nodes
2013 Cluster of
500 nodes
Cluster Size: Number of Nodes
8
9. AltoStor
Horizontal to Vertical Scaling
Horizontal scaling is limited by single-master architecture
Natural growth of compute power and storage density
Clusters composed of more dense & powerful servers
Vertical scaling leads to cluster size shrinking
Storage capacity, Compute power, and Cost remain constant
Exponential Information Growth
2006 Chevron accumulates 2 TB a day
2012 Facebook ingests 500 TB a day
9
10. AltoStor
Scalability for Hadoop 2.0
HDFS Federation
Independent NameNodes sharing a common pool of DataNodes
Cluster is a family of volumes with shared block storage layer
User sees volumes as isolated file systems
ViewFS: the client-side mount table
Yarn: New MapReduce framework
Dynamic partitioning of cluster resources: no fixed slots
Separation of JobTracker functions
1. Job scheduling and resource allocation: centralized
2. Job monitoring and job life-cycle coordination: decentralized
o Delegate coordination of different jobs to other nodes
10
11. AltoStor
Namespace Partitioning
Static: Federation
Directory sub-trees are statically assigned to
disjoint volumes
Relocating sub-trees without copying is
challenging
Scale x10: billions of files
Dynamic:
Files, directory sub-trees can move automatically
between nodes based on their utilization or load
balancing requirements
Files can be relocated without copying data blocks
Scale x100: 100s of billion of files
Orthogonal independent approaches.
Federation of distributed namespaces is possible
11
12. AltoStor
Giraffa File System
HDFS + HBase = Giraffa
Goal: build from existing building blocks
Minimize changes to existing components
1. Store file & directory metadata in HBase table
Dynamic table partitioning into regions
Cashed in RegionServer RAM for fast access
2. Store file data in HDFS DataNodes: data streaming
3. Block management
Handle communication with DataNodes:
heartbeat, blockReport, addBlock
Perform block allocation, replication, and deletion
12
13. AltoStor
Giraffa Requirements
Availability – the primary goal
Load balancing of metadata traffic
Same data streaming speed to / from DataNodes
Continuous Availability: No SPOF
Cluster operability, management
Cost of running larger clusters same as a smaller one
More files & more data
HDFS Federated HDFS Giraffa
Space 25 PB 120 PB 1 EB = 1000 PB
Files + blocks 200 million 1 billion 100 billion
Concurrent Clients 40,000 100,000 1 million
13
14. AltoStor
HBase Overview
Table: big, sparse, loosely structured
Collection of rows, sorted by row keys
Rows can have arbitrary number of columns
Dynamic Table partitioning!
Table is split Horizontally into Regions
Region Servers serve regions to applications
Columns grouped into Column families: vertical partition of tables
Distributed Cache:
Regions are loaded in nodes’ RAM
Real-time access to data
14
16. AltoStor
HBase API
HBaseAdmin: administrative functions
Create, delete, list tables
Create, update, delete columns, column families
Split, compact, flush
HTable: access table data
Result HTable.get(Get g) // get cells of a row
void HTable.put(Put p) // update a row
void HTable.delete(Delete d) // delete cells/row
ResultScanner getScanner(family) // scan col family
Variety Filters
Coprocessors:
Custom actions triggered by update events
Like database triggers or stored procedures
16
17. AltoStor
Building Blocks
Giraffa clients
Fetch file & block metadata from Namespace Service
Exchange data with DataNodes
Namespace Service
HBase Table stores File metadata as rows
Block Management
Distributed collection of Giraffa block metadata
Data Management
DataNodes. Distributed collection of data blocks
17
18. AltoStor
Giraffa Architecture
Namespace Service HBase
Namespace Table 1. Giraffa client
path, attrs, block[], DN[][] gets files
and blocks
1 Block Management Processor from HBase
2 2. Block
NamespaceAgent
Manager
App Block Management Layer handles
block
BM BM BM operations
3
3. Stream data
DN DN DN
DN DN DN to or from
DN DN DN
DataNodes
18
20. AltoStor
Namespace Table
Single Table called “Namespace” stores
Row Key = File ID
File attributes:
o Local name, owner, group, permissions, access-time,
modification-time, block-size, replication, isDir, length
List of blocks of a file
o Persisted in the table
List of block locations for each block
o Not persisted, but discovered from the BlockManager
Directory table
o maps directory entry name to respective child row key
20
21. AltoStor
Namespace Service
HBase Namespace Service
Region Server Region Server Region Server
Region Region Region
NS Processor
NS Processor
NS Processor
Region Region Region
1
…
… … …
Region Region Region
BM Processor BM Processor BM Processor
2
Block Management Layer
21
22. AltoStor
Block Manager
Maintains flat namespace of Giraffa block metadata
1. Block management
Block allocation, deletion, replication
2. DataNode management
Process DataNode block reports, heartbeats. Identify lost nodes
3. Storage for the HBase table
Small file system to store Hfiles, HLog
BM Server paired on the same node with RegionServer
Distributed cluster of BMServes
Mostly local communication between Region and BM Servers
NameNode as an initial implementation of BMServer
22
23. AltoStor
Data Management
DataNodes Store and Report data blocks;
Blocks are files on local drives
Data transfer to and from clients
Internal data transfers
Same as HDFS
23
24. AltoStor
Row Key Design
Row keys
Identify files and directories as rows in the table
Define sorting of rows in Namespace table
And therefore Namespace partitioning
Different row key definitions based on locality
requirement
Key definition is chosen during file system formatting
Full-path-key is the default implementation
Problem: Rename can move object to another region
Row keys based on INode numbers
24
25. AltoStor
Locality of Reference
Files in the same directory – adjacent in the table
Belong to the same region (most of the time)
Efficient “ls”. Avoid jumping across regions
Row keys define sorting of files and directories in the
table
Tree structured namespace is flattened into linear array
Ordered list of files is self-partitioned into regions
How to retain tree locality in linearized structure
25
26. AltoStor
Partitioning: Random
Straightforward partitioning based on random hashing
1
2 3 4
15 16
T1 T2 T3 T4
id1 id2 id3
26
27. AltoStor
Partitioning: Full Subtrees
Partitioning based on lexicographic full-path ordering
The default for Giraffa
1
2 3 4
15 16
T1 T2 T3 T4
1 1 1 1 1
2 2
T1 T2 3
T3 4
T4
15
27
29. AltoStor
Atomic Rename
Giraffa will implement atomic in-place rename
No support for atomic file move from one directory to another
Requires inode numbers as unique file IDs
A move can then be implemented on application level
Non-atomically move the file from the source directory to a
temporary file in the target directory
Atomically rename the temporary file to its original name
On failure use simple 3-step recovery procedure
Eventually implement atomic moves
PAXOS
Simplified synchronization algorithms (ZAB)
29
30. AltoStor
3-Step Recovery Procedure
Move of a file from srcDir to trgDir failed
1. If only the source file exists, then start the move over
2. If only the target temporary file exists, then complete
the move by renaming the temporary file to the original
name
3. If both the source and the temporary target file exist,
then remove the source and rename the temporary file
This step is non-atomic and may fail as well.
In case of failure repeat the recovery procedure
30
31. AltoStor
New Giraffa Functionality
Custom file attributes: user defined file metadata
Hidden in complex file names or nested directories
o /logs/2012/08/31/server-ip.log
Stored in Zookeeper or even stand-alone DBs
o Involves Synchronization
Advanced Scanning, Grouping, Filtering
Amazon S3 API turns Giraffa into reliable storage on the cloud
Versioning
Based on HBase row versioning
Restore objects deleted inadvertently
Alternative approach for snapshots
31
32. AltoStor
Status
We are on Apache Extra
One node cluster running
Row Key abstraction
HBase implementation in separate package
Other DBs or Key-Value stores can be plugged in
Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki
Server-side processing FS requests. HBase endpoints
Testing Giraffa with TestHDFSCLI
Web UI. Multi-node cluster. Release…
32
34. AltoStor
Related Work
Ceph
Metadata stored on OSD
MDS cache metadata: Dynamic Partitioning
Lustre
Plans to release (2.4) distributed namespace
Code ready
Colossus: from Google S.Quinlan and J.Dean
100 million files per metadata server
Hundreds of servers
VoldFS, CassandraFS, KTHFS (MySQL)
Prototypes
MapR distributed file system
34
35. AltoStor
History
(2008) Idea. Study of distributed systems
AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
Partitioning of the namespace: 4 types of partitioning
(2009) Study on scalability limits
NameNode optimization
(2010) Design with Michael Stack
Presentation at HDFS contributors meeting
(2011) Plamen implements POC
(2012) Rewrite open sourced as Apache Extras project
http://code.google.com/a/apache-extras.org/p/giraffa/
35
36. AltoStor
Etymology
Giraffe. Latin: Giraffa camelopardalis
Family Giraffidae
Genus Giraffa
Species Giraffa camelopardalis
Other languages
Arabic Zarafa
Spanish Jirafa
Bulgarian жирафа
Italian Giraffa
Favorites of my daughter
o As the Hadoop traditions require
36