Sahara is an OpenStack service that provides a platform for big data and data analytics workflows. It allows users to easily provision Hadoop clusters on OpenStack. Sahara supports various data processing engines and provides features such as cluster scaling, storage integration, and security. Benchmark testing showed significant performance overhead from using virtual machines compared to bare metal deployment, especially for I/O intensive workloads. Future work may include improved support for container-based and bare metal clusters as well as enhanced data processing and cluster management capabilities.
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFSMats Kindahl
This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for Hadoop which can be used to incorporate changes from MySQL into HDFS using the replication protocol.
MySQL Applier for Apache Hadoop: Real-Time Event Streaming to HDFSMats Kindahl
This presentation from MySQL Connect give a brief introduction to Big Data and the tooling used to gain insights into your data. It also introduces an experimental prototype of the MySQL Applier for Hadoop which can be used to incorporate changes from MySQL into HDFS using the replication protocol.
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Introduction to Apache Spark Workshop at Lambda World 2015 on October 23th and 24th, 2015, celebrated in Cádiz. Speakers: @fperezp and @juanpedromoreno
Github Repo: https://github.com/47deg/spark-workshop
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...Databricks
This talk is about sharing experience and lessons learned on setting up and running the Apache Spark service inside the database group at CERN. It covers the many aspects of this change with examples taken from use cases and projects at the CERN Hadoop, Spark, streaming and database services. The talks is aimed at developers, DBAs, service managers and members of the Spark community who are using and/or investigating “Big Data” solutions deployed alongside relational database processing systems. The talk highlights key aspects of Apache Spark that have fuelled its rapid adoption for CERN use cases and for the data processing community at large, including the fact that it provides easy to use APIs that unify, under one large umbrella, many different types of data processing workloads from ETL, to SQL reporting to ML.
Spark can also easily integrate a large variety of data sources, from file-based formats to relational databases and more. Notably, Spark can easily scale up data pipelines and workloads from laptops to large clusters of commodity hardware or on the cloud. The talk also addresses some key points about the adoption process and learning curve around Apache Spark and the related “Big Data” tools for a community of developers and DBAs at CERN with a background in relational database operations.
Introduction to Apache Spark Workshop at Lambda World 2015 on October 23th and 24th, 2015, celebrated in Cádiz. Speakers: @fperezp and @juanpedromoreno
Github Repo: https://github.com/47deg/spark-workshop
Integrating Existing C++ Libraries into PySpark with Esther KundinDatabricks
Bloomberg’s Machine Learning/Text Analysis team has developed many machine learning libraries for fast real-time sentiment analysis of incoming news stories. These models were developed using smaller training sets, implemented in C++ for minimal latency, and are currently running in production. To facilitate backtesting our production models across our full data set, we needed to be able to parallelize our workloads, while using the actual production code.
We also wanted to integrate the C++ code with PySpark and use it to run our models. In this talk, I will discuss some of the challenges we faced, decisions we made, and other options when dealing with integrating existing C++ code into a Spark system. The techniques we developed have been used successfully by our team multiple times and I am sure others will benefit from the gotchas that we were able to identify.
Building Real-Time BI Systems with Kafka, Spark, and Kudu: Spark Summit East ...Spark Summit
One of the key challenges in working with real-time and streaming data is that the data format for capturing data is not necessarily the optimal format for ad hoc analytic queries. For example, Avro is a convenient and popular serialization service that is great for initially bringing data into HDFS. Avro has native integration with Flume and other tools that make it a good choice for landing data in Hadoop. But columnar file formats, such as Parquet and ORC, are much better optimized for ad hoc queries that aggregate over large number of similar rows.
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
Spark had been elected, deservedly, as the main massive parallel processing framework, and HDFS is the one of the most popular Big Data storage technologies. Therefore its combination is one of the most usual Big Data’s use cases. But, what happens with the security? Can these two technologies coexist in a secure environment? Furthermore, with the proliferation of BI technologies adapted to Big Data environments, that demands that several users interacts with the same cluster concurrently, can we continue to ensure that our Big Data environments are still secure? In this lecture, Abel and Jorge will explain which adaptations of Spark´s core they had to perform in order to guarantee the security of multiple concurrent users using a single Spark cluster, which can use any of its cluster managers, without degrading the outstanding Spark’s performance.
Unlocking Your Hadoop Data with Apache Spark and CDH5SAP Concur
Spark/Mesos Seattle Meetup group shares the latest presentation from their recent meetup event on showcasing real world implementations of working with Spark within the context of your Big Data Infrastructure.
Session are demo heavy and slide light focusing on getting your development environments up and running including getting up and running, configuration issues, SparkSQL vs. Hive, etc.
To learn more about the Seattle meetup: http://www.meetup.com/Seattle-Spark-Meetup/members/21698691/
از نماینده ایران در WSIS Prizes 2016 حمایت کنید ... متشکریم ...Leila Esmaeili
اجلاس جهانی سران جامعه اطلاعاتی (World Summit on the Information Society -WSIS) سالانه توسط اتحادیه جهانی مخابرات و با پشتیبانی سازمان ملل متحد با موضوع فناوری اطلاعات و ارتباطات و جامعه اطلاعاتی (ITU) برگزار می-گردد. این نشست جهانی که با حضور وزرای مخابرات، فناوری اطلاعات و ارتباطات کشورهای دنیا شکل می ¬گیرد هجده خط مشی اصلی را دنبال می ¬کند تا کشورهای جهان را ترغیب نماید در راستای این هجده خط مشی گام برداشته و فناوری اطلاعات و ارتباطات کشور خود را توسعه و ترقی دهند.
از مهم ¬ترین خط¬ مش¬ های تعیین شده، توسعه موارد کاربرد فناوری اطلاعات و ارتباطات در زمینه آموزش الکترونیکی (e-learning) و علم الکترونیکی (e-science) می¬ باشد که راهکار یکپارچه آموزش مجازی (ELIS) و شبکه کوثرنت (Kowsar-Net) به ترتیب در این راستا توسعه یافته و هم¬اکنون به عنوان نماینده ایران برای کسب جایزه جهانی اجلاس جهانی سران با رقبای خارجی خود وارد رقابت شده است. هم اکنون، این دو پروژه پس از قبولی در مرحله ارزیابی و داوری بین المللی نامزد دریافت جایزه بهترین پروژه در گروه آموزش الکترونیکی و علم الکترونیک شده و در مرحله رای¬ گیری عمومی می ¬باشد.
از این رو، این پیام با هدف حمایت از نماینده کشورمان در این عرصه بین المللی ارسال شده است. با توجه به رسالت این دو پروژه و ماهیت علمی، فرهنگی و اسلامی آن، از شما سرور گرامی دعوت می شود ضمن رای دادن و اعلام حمایت خود از پروژه های نامبرده (گروه 9 و 14)، سایر دوستان، همکاران و متخصصان را به حمایت از راهکار یکپارچه آموزش مجازی (ELIS) و شبکه کوثرنت (Kowsar-Net) دعوت نمایید. ضمنا رای شما به سایر پروژه¬ ها در دیگر گروه ها نیز ارزشمند است.
آدرس سایت جهت ثبت نام و رای دهی:
http://groups.itu.int/stocktaking/WSISPrizes/WSISPrizes2016/Voting.aspx
آخرین مهلت رای دهی: 20 اسفند 94
جهت کسب اطلاعات بیشتر کانال ما را در تلگرام دنبال کنید wsisprizeswhc@
Hadoop is often viewed as needing racks of dedicated boxes -despite the fact that in sheer number terms, the majority of Hadoop clusters ever created have been brought up on public cloud infrastructures -particularly Amazon`s. Yet the rest of datacenter computing is moving towards virtualization -be it in-cloud startups or in-enterprise IT departments. Some organizations are standing up private clouds: a rack or two of servers with an API for VM creation. Hadoop can live there -it just needs to integrate better. At the same time, OpenStack is emerging as the de-facto standard open source cloud platform for private use, and is available publicly from a number of cloud infrastructure service providers. This talk looks at what we`ve done -and are doing- to integrate Hadoop with OpenStack. This is taking it beyond Hadoop`s current support for Amazon`s infrastructure, making a combined Hadoop + OpenStack cluster something to consider in-house -and in-cloud. This work is being done in collaboration with members of the OpenStack community, showing how cloud and big data projects can not only co-exist, we can co-develop our platforms.
اسلایدهای کارگاه عملی امنیت داده و مدیریت مخاطرات در رایانش ابری - 20 آبان - دانشگاه شریف
به همراه مثالهایی از رایانش ابری به زبان ساده و نتایج مطالعه موردی تحلیل مخاطرات فکس ابری
The Evolution of OpenStack – From Infancy to EnterpriseRackspace
As OpenStack turns 5 this year, we thought it would be a good time to take a look back at the evolution of OpenStack. We start with a quick overview of what OpenStack is, how OpenStack came to be and describe the OpenStack Foundation. Next we describe the problem that OpenStack helps to solve, the components of OpenStack and the timeline for when these components came to be. Last, we outline the current features and benefits that make OpenStack ready for the enterprise with supporting Enterprise use case examples. Blog can be found here (
https://developer.rackspace.com/blog/evolution-of-openstack-from-infancy-to-enterprise/) and webinar can be found here (https://www.brighttalk.com/webcast/11427/138613)
The massive computing and storage resources that are needed to support big data applications make cloud environments an ideal fit. In this session, you'll learn how to build your big data "database on-demand" using MongoDB, Cassandra, Solr, MySQL, or any other big data solution, as well as manage your big data application using a new open source framework called “Cloudify.” All this, on top of the OpenStack cloud.
Our Multi-Year Journey to a 10x Faster Confluent CloudHostedbyConfluent
"Confluent Cloud is a cloud-native service based on Apache Kafka. We run tens of thousands of clusters across all major cloud service providers (AWS, GCP and Azure). In this talk, we will go over our journey to make Confluent Cloud 10x faster than Apache Kafka.
We will talk about how we designed our various workloads, the complexities involved in our cloud-native service, the challenges we faced, and the various pitfalls we ran into. We will also cover the interesting learnings, which in hindsight, are first principles from this multi-year journey.
By attending this talk, attendees will be able to take our learnings from making Confluent Cloud latencies 10x better and possibly apply similar principles to their cloud native data streaming systems."
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
July 2018 talk to SW Data Meetup by Rob Vesse, Software Engineer, Cray Inc, discussing open source technologies for data science on high performance systems (Spark, Hadoop, PyData ecosystem, containers, etc), focusing on some of the implementation and scaling challenges they face.
Lookout on Scaling Security to 100 Million DevicesScyllaDB
The massive increase of security-related data requires companies to respond with new approaches to ingestion. Learn how Lookout has changed its approach for ingesting telemetry to meet their goal of growing from 1.5 million devices to 100 million devices and beyond, using Kafka Connect and switching from AWS DynamoDB to Scylla.
Improve performance and gain room to grow by easily migrating to a modern Ope...Principled Technologies
We deployed this modern environment, then migrated database VMs from legacy servers and saw performance improvements that support consolidation
Conclusion
If your organization’s transactional databases are running on gear that is several years old, you have much to gain by upgrading to modern servers with new processors and networking components and an OpenShift environment. In our testing, a modern OpenShift environment with a cluster of three Dell PowerEdge R7615 servers with 4th Generation AMD EPYC processors and high-speed 100Gb Broadcom NICs outperformed a legacy environment with MySQL VMs running on a cluster of three Dell PowerEdge R7515 servers with 3rd Generation AMD EPYC processors and 25Gb Broadcom NICs. We also easily migrated a VM from the legacy environment to the modern environment, with only a few steps required to set up and less than ten minutes of hands-on time. The performance advantage of the modern servers would allow a company to reduce the number of servers necessary to perform a given amount of database work, thus lowering operational expenditures such as power and cooling and IT staff time for maintenance. The high-speed 100Gb Broadcom NICs in this solution also give companies better network performance and networking capacity to grow as they embrace emerging technologies such as AI that put great demands on networks.
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
In this deck, Johann Lombardi from Intel presents: DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence.
"Intel has been building an entirely open source software ecosystem for data-centric computing, fully optimized for Intel® architecture and non-volatile memory (NVM) technologies, including Intel Optane DC persistent memory and Intel Optane DC SSDs. Distributed Asynchronous Object Storage (DAOS) is the foundation of the Intel exascale storage stack. DAOS is an open source software-defined scale-out object store that provides high bandwidth, low latency, and high I/O operations per second (IOPS) storage containers to HPC applications. It enables next-generation data-centric workflows that combine simulation, data analytics, and AI."
Unlike traditional storage stacks that were primarily designed for rotating media, DAOS is architected from the ground up to make use of new NVM technologies, and it is extremely lightweight because it operates end-to-end in user space with full operating system bypass. DAOS offers a shift away from an I/O model designed for block-based, high-latency storage to one that inherently supports fine- grained data access and unlocks the performance of next- generation storage technologies.
Watch the video: https://youtu.be/wnGBW31yhLM
Learn more: https://www.intel.com/content/www/us/en/high-performance-computing/daos-high-performance-storage-brief.html
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16MLconf
Scaling Spark – Vertically: The mantra of Spark technology is divide and conquer, especially for problems too big for a single computer. The more you divide a problem across worker nodes, the more total memory and processing parallelism you can exploit. This comes with a trade-off. Splitting applications and data across multiple nodes is nontrivial, and more distribution results in more network traffic which becomes a bottleneck. Can you achieve scale and parallelism without those costs?
We’ll show results of a variety of Spark application domains including structured data, graph processing and common machine learning in a single, high-capacity scaled-up system versus a more distributed approach and discuss how virtualization can be used to define node size flexibly, achieving the best balance for Spark performance.
Enterprise data centers house numerous workloads. With Hadoop growing in these data centers, IT departments need tools to avoid creating silos, while maintaining SLAs, reporting and charge-back requirements. We present a completely open source reference architecture including Apache Hadoop, Linux cgroups and namespace isolation, Gluster and HTCondor. Topics to be covered – . Augmenting existing HDFS and MapReduce infrastructure with dynamically provisioned resources . On-demand creating, growing and shrinking MapReduce infrastructure for user workload . Isolating workloads to enable multi-tenant access to resources . Publishing of resource utilization and accounting information for ingest into charge-back systems
Similar to Benchmarking sahara based big data as a service solutions (20)
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
Maruthi Prithivirajan, Head of ASEAN & IN Solution Architecture, Neo4j
Get an inside look at the latest Neo4j innovations that enable relationship-driven intelligence at scale. Learn more about the newest cloud integrations and product enhancements that make Neo4j an essential choice for developers building apps with interconnected data and generative AI.
Climate Impact of Software Testing at Nordic Testing DaysKari Kakkonen
My slides at Nordic Testing Days 6.6.2024
Climate impact / sustainability of software testing discussed on the talk. ICT and testing must carry their part of global responsibility to help with the climat warming. We can minimize the carbon footprint but we can also have a carbon handprint, a positive impact on the climate. Quality characteristics can be added with sustainability, and then measured continuously. Test environments can be used less, and in smaller scale and on demand. Test techniques can be used in optimizing or minimizing number of tests. Test automation can be used to speed up testing.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
2. o Why Sahara
o Sahara introduction
o Deployment considerations
o Performance testing and results
o Future envisioning
o Summary and Call to Action
2
Agenda
3. o You or someone at your company is using AWS, Azure, or Google
o You’re probably doing it for easy access to OS instances, but also
the modern application features, e.g. AWS’ EMR or RDS or Storage
o Migrating to, or even using, OpenStack infrastructure for workloads
means having application features, e.g. Sahara & Trove
o Writing applications is complex enough without having to manage
supporting (non-value-add) infrastructure
3
Why Sahara: Cloud features
4. o Writing data analysis applications are especially hard
o Complexity in acquiring data
o Complexity in organizing (ETL) data
o Complexity in analyzing data
o Complexity in integrating analysis into applications (bonus!)
o This compounded complexity does not even include the
necessary tooling and infrastructure
4
Why Sahara: Data analysis
5. o Why Sahara
o Sahara introduction
o Deployment considerations
o Performance testing and results
o Future envisioning
o Summary and Call to Action
5
Agenda
6. o Repeatable cluster provisioning and management
operations
o Data processing workflows (EDP)
o Cluster scaling (elasticity), Storage integration
(Swift, Cinder, HCFS)
o Network and security group (firewall) integration
o Service anti-affinity (fault domains & efficiency)
6
Sahara features
8. o Users get choice of integrated data progressing
engines
o Vendors get a way to integrate with OpenStack and
access users
o Upstream - Apache Hadoop (Vanilla), Hortonworks,
Cloudera, MapR, Apache Spark, Apache Storm
o Downstream - depends on your OpenStack vendor
8
Sahara plugins
9. o Why Sahara
o Sahara introduction
o Deployment considerations
o Performance testing and results
o Future envisioning
o Summary and Call to Action
9
Agenda
10. 10
Storage Architecture
#2 #4#3#1
Host
HDFS
VM
Comput-
ing Task
VM
Comput-
ing Task
Host
HDFS
VM
Comput-
ing Task
VM
HDFS
Host
HDFS
VM
Comput-
ing Task
VM
Comput-
ing Task
Legacy
NFS
GlusterFS Ceph
External
HDFS Swift
HDFS
Scenario #1: computing and data service collocate in the
VMs
Scenario #2: data service locates in the host world
Scenario #3: data service locates in a separate VM world
Scenario #4: data service locates in the remote network
o Tenant provisioned (in VM)
o HDFS in the same VMs of computing tasks vs. in
the different VMs
o Ephemeral disk vs. Cinder volume
o Admin provided
o Logically disaggregated from computing tasks
o Physical collocation is a matter of deployment
o For network remote storage, Neutron DVR is very
useful feature
o A disaggregated (and centralized) storage system has
significant values
o No data silos, more business opportunities
o Could leverage Manila service
o Allow to create advanced solutions (.e.g. in-memory
overlayer)
o More vendor specific optimization opportunities
11. o Container seems to be promising but still need better support
o Determining the appropriate cluster size is always a challenge to tenants
o e.g. small flavor with more nodes or large flavor with less nodes
11
Compute Engine
Pros Cons
VM
• Best support in OpenStack
• Strong security
• Slow to provision
• Relatively high runtime performance overhead
Container
• Light-weight, fast provisioning
• Better runtime performance than VM
• Nova-docker readiness
• Cinder volume support is not ready yet
• Weaker security than VM
• Not the ideal way to use container
Bare-Metal
• Best performance and QoS
• Best security isolation
• Ironic readiness
• Worst efficiency (e.g. consolidation of workloads with
different behaviors)
• Worst flexibility (e.g. migration)
• Worst elasticity due to slow provisioning
12. o Direct Cluster Operations
o Sahara is used as a provisioning engine
o Tenants expect to have direct access to the virtual cluster
o e.g. directly SSH into the VMs
o May use whatever APIs comes with the distro
o e.g. Oozie
o EDP approach
o Sahara’s EDP is designed to be an abstraction layer for tenants to consume the services
o Ideally should be vendor neutral and plugin agnostic
o Limited job types are supported at present
o 3rd party abstraction APIs
o Not supported yet
o e.g. Cask CDAP
Data Processing API
13. Deployment Considerations Matrix
Storage
Compute
Distro/Plugin
Data Processing API
Vanilla CDH HDP MapRSpark
VM Container Bare-metal
Tenant vs. Admin
provisioned
Disaggregated vs. Collocated HDFS vs. other options
Traditional EDP
(Sahara native)
3rd party APIs
Storm
Performance results in the next section
14. o Why Sahara
o Sahara introduction
o Deployment considerations
o Performance testing and results
o Future envisioning
o Summary and Call to Action
14
Agenda
16. Ephemeral Disk Performance
Host
HDFS
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS
RAID
…..
…..vs.
● 1.3x read overhead, 2.1x overhead
○ disk access pattern change: 10%
○ virtualization overhead: 90%
■ 60% due to I/O overhead
■ 30% due to memory efficiency in virtualization
Heavy tuning is required.
1.3x overhead
2.1x overhead
17. o Performance difference comes from several factors:
o Free of I/O virtualization overhead
o DVR bring huge performance enhancement
o Location awareness(HVE) is not enabled
External HDFS Performance
Host
HDFS
VMVM
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS
RAID
…..
….. vs.
RAID
2x improvement
1.3x overhead
18. Host
Swift Performance
o Similar to external HDFS but even worse
o Without enabling location awareness, all the traffic go through the Swift
proxy node
Swift
…..
VMVM
Host
Nova Inst.
Store
VM
HDFS
VM
HDFS
RAID
…..
….. vs.
RAID
1.35x overhead
19. ● For an I/O intensive workload, 2.19x overhead is big but is consistent with previous
results.
● Container demonstrates a fair performance compared to KVM
○ considering OpenStack services also consumes resources, 0.46x is not that bad.
Bare-metal vs. Container vs. VM
Host
Container
Host Host
VM
HDFS
HDFS
HDFS
2.19X
1.46X
1X
20. o Why Sahara
o Sahara introduction
o Deployment considerations
o Performance testing and results
o Future envisioning
o Summary and Call to Action
20
Agenda
21. o Architectured for disaggregated computing and storage
o Supporting more storage backend
o Integration with Manila
o Better support for container and bare-metal (Nova-docker and Ironic)
o Magnum as an alternative?
o EDP as a PaaS like layer for Sahara core provisioning engine
o Data connector abstraction
o Workflow management
o Policy engine for resource and SLA management
o Auto-scale, auto-tune
o Sahara needs to offer broader vendor integration opportunities (not just engines)
o A complete big data stack may have many options at each layer
o e.g. acceleration libraries, analytics, developer oriented application framework (e.g.
CDAP)
o Requires a more generic plugin/driver framework to support it
21
Future of Sahara? (NOT a roadmap)
22. o Upgrade HDP plugin to HDP 2.2
oHadoop HA for CDH and HDP
oEnhancements to provisioning through Heat
oBring Sahara to python-openstackclient
oBare-metal clusters with Ironic
oSecurity enhancements
oEDP enhancements
oJob scheduling
oCoordination
oLog retrieval
oImproved parameter specification
22
Liberty Roadmap Highlights
23. o Great improvement in Sahara Kilo release. Production ready with
real customer deployments.
o A complete Big-Data-as-a-Service solution requires more
considerations than simply adding a Sahara service to the
existing OpenStack deployment
o Preliminary benchmark results show the performance gap with
bare-metal is still huge. Tuning and optimizations are required.
o Many features could be added to enhance Sahara. Opportunities
exist for various types of vendors.
23
Summary and Call-to-Action
Join in the Sahara community and make it even more vibrant!