The document provides an overview of Hadoop, a framework for distributed storage and processing of large datasets across clusters of computers. It discusses how Hadoop addresses the limitations of scaling to a single system by allowing horizontal scaling across multiple servers. The core Hadoop components of HDFS and MapReduce/YARN are explained. HDFS provides a distributed file system that replicates data across servers for fault tolerance. MapReduce/YARN allows distributed processing of large datasets in parallel. The document also outlines some of the popular tools in the Hadoop ecosystem like Pig, Hive, HBase, Spark and others.
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
Amazon RDS for PostgreSQL - Postgres Open 2016 - New Features and Lessons Lea...Grant McAlister
Presentation from Postgres Open 2016 in Dallas (Sept 2016) - Covers new RDS features introduced over the last year and lessons learned operating a large fleet of PostgreSQL.
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
Improving HDFS Availability with Hadoop RPC Quality of ServiceMing Ma
Heavy users monopolizing cluster resources is a frequent cause of slowdown for others. With only one namenode and thousands of datanodes, any poorly written application is a potential distributed denial-of-service attack on namenode. In this talk, you will learn how to prevent slowdown from heavy users and poorly-written applications by enabling IPC Quality of Service (QoS), a new feature in Hadoop 2.6+. On Twitter’s and eBay’s production clusters, we’ve seen response times of 500 milliseconds with QoS off drop to 10 milliseconds with QoS on during heavy usage. We’ll cover how IPC QoS works and share our experience on how to tune performance.
Amazon RDS for PostgreSQL - Postgres Open 2016 - New Features and Lessons Lea...Grant McAlister
Presentation from Postgres Open 2016 in Dallas (Sept 2016) - Covers new RDS features introduced over the last year and lessons learned operating a large fleet of PostgreSQL.
Are you taking advantage of all of Hadoop’s features to operate a stable and effective cluster? Inspired by real-world support cases, this talk discusses best practices and new features to help improve incident response and daily operations. Chances are that you’ll walk away from this talk with some new ideas to implement in your own clusters.
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
This talk will focus on the journey we in the Arm Treasure Data hadoop team is on to simplify and automate how we deploy hadoop. In Arm Treasure Data, up to recently we were running hadoop clusters in two clouds. Due to fast increase of deployments into more sites, the overhead of manual operations has started to strain us. Due to this, we started a project last year to automate and simplify how we deploy using tools like AWS autoscaling groups. Steps we have taken so far are modernize and standardize instance types, moved from manually executed deployment scripts to api triggered work flows, actively working to deprecate chef in favor of debian packages and AWS Codedeploy. We have also started to automate a lot of operations that up to recently were manual, like scaling in and out clusters, and routing traffic between clusters. We also started simplify health check and node snapshotting. And our goal of the year is close to fully automated cluster operations.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017Grant McAlister
We will begin with a quick overview of the Amazon RDS service and how it achieves durability and high availability. Then we will do a deep dive into the exciting new features we recently released, including 9.6, snapshot sharing, enhancements to encryption, vacuum, and replication. We will also explore lessons we have learned managing a large fleet of PostgreSQL instances, including important tunables and possible gotchas around pg_upgrade. During the session we also briefly cover our newly announced Aurora PostgreSQL compatible edition. We will wrap up the session with benchmarking of new RDS instance classes, and the value proposition of these new instance types.
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising.
Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster.
This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.
Deep dive into the Rds PostgreSQL Universe Austin 2017Grant McAlister
A deep dive into the two RDS PostgreSQL offerings, RDS PostgreSQL and Aurora PostgreSQL. Covering what is common between the engines, what is different and updates that we have done over the past year.
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests.
In this session, we present SOS’s multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations.
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
Big Telco, Bigger real-time demands: Real-time processing in Telco
- Presented by Jung-ryong Lee, engineer manager at SK Telecom at Gruter TECHDAY 2014 Oct.29 Seoul, Korea
Amazon Aurora with PostgreSQL Compatibility is a relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. We review the functionality in order to understand the architectural differences that contribute to improved scalability, availability, and durability. We also dive deep into the capabilities of the service and review the latest available features. Finally, we walk through the techniques that can be used to migrate to Amazon Aurora.
Speakers: Kevin O'Dell, Aleksandr Shulman & Kathleen Ting (Cloudera)
From supporting the 0.90.x, 0.92, 0.94, and 0.96 HBase installations on clusters ranging from tens to hundreds of nodes, Cloudera has seen it all. Having automated the upgrade paths from the different Apache releases, we have developed a smooth path that can help the community with upcoming upgrades. In addition to automation best practices, in this talk you'll also learn proactive configuration tweaks and operational best practices to keep your HBase cluster always up and running. We'll also walk through how to contain an application bug let loose in production, to minimize the impact on HBase posed by faulty hardware, and the direct correlation between inefficient schema design and HBase performance.
Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing us to decouple the RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more. In this talk, you'll hear what we've learned and explain why this approach could fundamentally change HBase operations.
Gary Grider from Los Alamos National Laboratory presented this deck at the 2016 OpenFabrics Workshop.
"Trends in computer memory/storage technology are in flux perhaps more so now than in the last two decades. Economic analysis of HPC storage hierarchies has led to new tiers of storage being added to the next fleet of supercomputers including Burst Buffers or In-System Solid State Storage and Campaign Storage. This talk will cover the background that brought us these new storage tiers and postulate what the economic crystal ball looks like for the coming decade. Further it will suggest methods of leveraging HPC workflow studies to inform the continued evolution of the HPC storage hierarchy."
Watch the video presentation: https://www.youtube.com/watch?v=iDYLIpF-6Ew
See more talks from the Open Fabrics Workshop: http://insidehpc.com/2016-open-fabrics-workshop-video-gallery/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL CompatibilityGrant McAlister
Amazon Aurora with PostgreSQL compatibility is a relational database managed service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source PostgreSQL. This session highlights Aurora with PostgreSQL compatibility’s key capabilities, including low-latency read replicas and Multi-AZ deployments; reviews the architectural enhancements that contribute to Aurora’s improved scalability, availability, and durability; and digs into the latest feature releases. Finally, this session walks through techniques to migrate to Aurora.
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
Automation of Hadoop cluster operations in Arm Treasure DataYan Wang
This talk will focus on the journey we in the Arm Treasure Data hadoop team is on to simplify and automate how we deploy hadoop. In Arm Treasure Data, up to recently we were running hadoop clusters in two clouds. Due to fast increase of deployments into more sites, the overhead of manual operations has started to strain us. Due to this, we started a project last year to automate and simplify how we deploy using tools like AWS autoscaling groups. Steps we have taken so far are modernize and standardize instance types, moved from manually executed deployment scripts to api triggered work flows, actively working to deprecate chef in favor of debian packages and AWS Codedeploy. We have also started to automate a lot of operations that up to recently were manual, like scaling in and out clusters, and routing traffic between clusters. We also started simplify health check and node snapshotting. And our goal of the year is close to fully automated cluster operations.
Are you using the fastest query tool for Hadoop? Provide and discuss the latest performance results of the industry standard TPC_H benchmarks executed across an assortment of open source query tools such as Hive (using MR, TEZ, LLAP, SPARK), SparkSQL, Presto, and Drill. Additionally, the performance tests will utilize a variety of data sizes and popular storage formats such as ORC, Parquet and Text and compression codecs.
Amazon RDS for PostgreSQL: What's New and Lessons Learned - NY 2017Grant McAlister
We will begin with a quick overview of the Amazon RDS service and how it achieves durability and high availability. Then we will do a deep dive into the exciting new features we recently released, including 9.6, snapshot sharing, enhancements to encryption, vacuum, and replication. We will also explore lessons we have learned managing a large fleet of PostgreSQL instances, including important tunables and possible gotchas around pg_upgrade. During the session we also briefly cover our newly announced Aurora PostgreSQL compatible edition. We will wrap up the session with benchmarking of new RDS instance classes, and the value proposition of these new instance types.
Pilot Hadoop Towards 2500 Nodes and Cluster RedundancyStuart Pook
Hadoop has become a critical part of Criteo's operations. What started out as a proof of concept has turned into two in-house bare-metal clusters of over 2200 nodes. Hadoop contains the data required for billing and, perhaps even more importantly, the data used to create the machine learning models, computed every 6 hours by Hadoop, that participate in real time bidding for online advertising.
Two clusters do not necessarily mean a redundant system, so Criteo must plan for any of the disasters that can destroy a cluster.
This talk describes how Criteo built its second cluster in a new datacenter and how to do it better next time. How a small team is able to run and expand these clusters is explained. More importantly the talk describes how a redundant data and compute solution at this scale must function, what Criteo has already done to create this solution and what remains undone.
Deep dive into the Rds PostgreSQL Universe Austin 2017Grant McAlister
A deep dive into the two RDS PostgreSQL offerings, RDS PostgreSQL and Aurora PostgreSQL. Covering what is common between the engines, what is different and updates that we have done over the past year.
SOS: Optimizing Shuffle I/O with Brian Cho and Ergin SeyfeDatabricks
Data shuffling is a costly operation. At Facebook, single job shuffles can reach the scale of over 300TB compressed using (relatively cheap) large spinning disks. However, shuffle reads issue large amounts of inefficient, small, random I/O requests to disks and can be a large source of job latency as well as waste of reserved system resources. In order to boost shuffle performance and improve resource efficiency, we have developed Spark-optimized Shuffle (SOS). This shuffle technique effectively converts a large number of small shuffle read requests into fewer large, sequential I/O requests.
In this session, we present SOS’s multi-stage shuffle architecture and implementation. We will also share our production results and future optimizations.
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
Big Telco, Bigger real-time demands: Real-time processing in Telco
- Presented by Jung-ryong Lee, engineer manager at SK Telecom at Gruter TECHDAY 2014 Oct.29 Seoul, Korea
Amazon Aurora with PostgreSQL Compatibility is a relational database service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source databases. We review the functionality in order to understand the architectural differences that contribute to improved scalability, availability, and durability. We also dive deep into the capabilities of the service and review the latest available features. Finally, we walk through the techniques that can be used to migrate to Amazon Aurora.
Speakers: Kevin O'Dell, Aleksandr Shulman & Kathleen Ting (Cloudera)
From supporting the 0.90.x, 0.92, 0.94, and 0.96 HBase installations on clusters ranging from tens to hundreds of nodes, Cloudera has seen it all. Having automated the upgrade paths from the different Apache releases, we have developed a smooth path that can help the community with upcoming upgrades. In addition to automation best practices, in this talk you'll also learn proactive configuration tweaks and operational best practices to keep your HBase cluster always up and running. We'll also walk through how to contain an application bug let loose in production, to minimize the impact on HBase posed by faulty hardware, and the direct correlation between inefficient schema design and HBase performance.
Adobe has packaged HBase in Docker containers and uses Marathon and Mesos to schedule them—allowing us to decouple the RegionServer from the host, express resource requirements declaratively, and open the door for unassisted real-time deployments, elastic (up and down) real-time scalability, and more. In this talk, you'll hear what we've learned and explain why this approach could fundamentally change HBase operations.
Gary Grider from Los Alamos National Laboratory presented this deck at the 2016 OpenFabrics Workshop.
"Trends in computer memory/storage technology are in flux perhaps more so now than in the last two decades. Economic analysis of HPC storage hierarchies has led to new tiers of storage being added to the next fleet of supercomputers including Burst Buffers or In-System Solid State Storage and Campaign Storage. This talk will cover the background that brought us these new storage tiers and postulate what the economic crystal ball looks like for the coming decade. Further it will suggest methods of leveraging HPC workflow studies to inform the continued evolution of the HPC storage hierarchy."
Watch the video presentation: https://www.youtube.com/watch?v=iDYLIpF-6Ew
See more talks from the Open Fabrics Workshop: http://insidehpc.com/2016-open-fabrics-workshop-video-gallery/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
From docker to kubernetes: running Apache Hadoop in a cloud native wayDataWorks Summit
Creating containers for an application is easy (even if it’s a goold old distributed application like Apache Hadoop), just a few steps of packaging.
The hard part isn't packaging: it's deploying
How can we run the containers together? How to configure them? How do the services in the containers find and talk to each other? How do you deploy and manage clusters with hundred of nodes?
Modern cloud native tools like Kubernetes or Consul/Nomad could help a lot but they could be used in different way.
It this presentation I will demonstrate multiple solutions to manage containerized clusters with different cloud-native tools including kubernetes, and docker-swarm/compose.
No matter which tools you use, the same questions of service discovery and configuration management arise. This talk will show the key elements needed to make that containerized cluster work.
Tools:
kubernetes, docker-swam, docker-compose, consul, consul-template, nomad
together with: Hadoop, Yarn, Spark, Kafka, Zookeeper, Storm….
References:
https://github.com/flokkr
Speaker
Marton Elek, Lead Software Engineer, Hortonworks
re:Invent 2020 DAT301 Deep Dive on Amazon Aurora with PostgreSQL CompatibilityGrant McAlister
Amazon Aurora with PostgreSQL compatibility is a relational database managed service that combines the speed and availability of high-end commercial databases with the simplicity and cost-effectiveness of open-source PostgreSQL. This session highlights Aurora with PostgreSQL compatibility’s key capabilities, including low-latency read replicas and Multi-AZ deployments; reviews the architectural enhancements that contribute to Aurora’s improved scalability, availability, and durability; and digs into the latest feature releases. Finally, this session walks through techniques to migrate to Aurora.
At Twitter we started out with a large monolithic cluster that served most of the use-cases. As the usage expanded and the cluster grew accordingly, we realized we needed to split the cluster by access pattern. This allows us to tune the access policy, SLA, and configuration for each cluster. We will explain our various use-cases, their performance requirements, and operational considerations and how those are served by the corresponding clusters. We will discuss what our baseline Hadoop node looks like. Various, sometimes competing, considerations such as storage size, disk IO, CPU throughput, fewer fast cores versus many slower cores, 1GE bonded network interfaces versus a single 10 GE card, 1T, 2T or 3T disk drives, and power draw all need to be considered in a trade-off where cost and performance are major factors. We will show how we have arrived at quite different hardware platforms at Twitter, not only saving money, but also increasing performance.
Hoodie (Hadoop Upsert Delete and Incremental) is an analytical, scan-optimized data storage abstraction which enables applying mutations to data in HDFS on the order of few minutes and chaining of incremental processing in hadoop
AWS June Webinar Series - Getting Started: Amazon RedshiftAmazon Web Services
Amazon Redshift is a fast, fully-managed petabyte-scale data warehouse service, for less than $1,000 per TB per year. In this presentation, you'll get an overview of Amazon Redshift, including how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. Learn how, with just a few clicks in the AWS Management Console, you can set up with a fully functional data warehouse, ready to accept data without learning any new languages and easily plugging in with the existing business intelligence tools and applications you use today. This webinar is ideal for anyone looking to gain deeper insight into their data, without the usual challenges of time, cost and effort. In this webinar, you will learn: • Understand what Amazon Redshift is and how it works • Create a data warehouse interactively through the AWS Management Console • Load some data into your new Amazon Redshift data warehouse from S3 Who Should Attend • IT professionals, developers, line-of-business managers
NYC Hadoop Meetup - MapR, Architecture, Philosophy and ApplicationsJason Shao
Slides from: http://www.meetup.com/Hadoop-NYC/events/34411232/
There are a number of assumptions that come with using standard Hadoop that are based on Hadoop's initial architecture. Many of these assumptions can be relaxed with more advanced architectures such as those provided by MapR. These changes in assumptions have ripple effects throughout the system architecture. This is significant because many systems like Mahout provide multiple implementations of various algorithms with very different performance and scaling implications.
I will describe several case studies and use these examples to show how these changes can simplify systems or, in some cases, make certain classes of programs run an order of magnitude faster.
About the speaker: Ted Dunning - Chief Application Architect (MapR)
Ted has held Chief Scientist positions at Veoh Networks, ID Analytics and at MusicMatch, (now Yahoo Music). Ted is responsible for building the most advanced identity theft detection system on the planet, as well as one of the largest peer-assisted video distribution systems and ground-breaking music and video recommendations systems. Ted has 15 issued and 15 pending patents and contributes to several Apache open source projects including Hadoop, Zookeeper and Hbase. He is also a committer for Apache Mahout. Ted earned a BS degree in electrical engineering from the University of Colorado; a MS degree in computer science from New Mexico State University; and a Ph.D. in computing science from Sheffield University in the United Kingdom. Ted also bought the drinks at one of the very first Hadoop User Group meetings.
With Hadoop-3.0.0-alpha2 being released in January 2017, it's time to have a closer look at the features and fixes of Hadoop 3.0.
We will have a look at Core Hadoop, HDFS and YARN, and answer the emerging question whether Hadoop 3.0 will be an architectural revolution like Hadoop 2 was with YARN & Co. or will it be more of an evolution adapting to new use cases like IoT, Machine Learning and Deep Learning (TensorFlow)?
In this presentation, you will get a look under the covers of Amazon Redshift, a fast, fully-managed, petabyte-scale data warehouse service for less than $1,000 per TB per year. Learn how Amazon Redshift uses columnar technology, optimized hardware, and massively parallel processing to deliver fast query performance on data sets ranging in size from hundreds of gigabytes to a petabyte or more. We'll also walk through techniques for optimizing performance and, you’ll hear from a specific customer and their use case to take advantage of fast performance on enormous datasets leveraging economies of scale on the AWS platform.
Speakers:
Ian Meyers, AWS Solutions Architect
Toby Moore, Chief Technology Officer, Space Ape
We discuss the current state of LLAP (Live Long and Process) – the concurrent sub-second execution of analytical queries engine for Hive 2.0. LLAP is a hybrid execution model that enables performance improvement in and across queries, such as caching of columnar data with cache coherence and intelligent eviction for disaggregated storage models (like S3, Isilon, Azure), JIT-friendly operator pipelines, asynchronous I/O, data pre-fetching and multi-threaded processing. LLAP features robust machine and service failure tolerance achieved by building on top of the time-tested fault tolerant subsystems, as well as a concurrency-directed design that achieves high utilization with low latency via resource sharing, reducing overheads for multiple queries, and enabling the system to preempt tasks of lower priority without failing any query in-flight. The talk also aims to cover the novel deployment model required for hybrid execution. The elasticity demands of the system are served by a long-lived YARN service interacting with on-demand elastic containers serving as a tightly integrated DAG-based framework for query execution. We discuss the current state of the project, performance numbers, deployment and usage strategy, as well as future work, including how LLAP fits into a unified secure DataFrame access layer.
Updated version of my talk about Hadoop 3.0 with the newest community updates.
Talk given at the codecentric Meetup Berlin on 31.08.2017 and on Data2Day Meetup on 28.09.2017 in Heidelberg.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualityInflectra
In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring.
Learn about:
• The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks.
• Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective.
• Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification.
• Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process.
Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.
10. What is Hadoop?
Hadoop is an open source framework for
large-scale data storing & processing.
11. Why Hadoop?
• Traditional Data processing was done on large systems.
• Every time need for better performance arises , they would replace
the old computer with better ones.
• Scaling up was expensive
• Also scaling was limited to the maximum available resources of a
single system.
12. How does Hadoop Scale?
• ”Scale Out” , rather than “Scale Up”
• If data set/data processing requirement increases , add in one more
server.
• Eliminates the strategy of growing computing capacity by throwing
more expensive hardware at the problem.
15. HDFS
Distributed: Scale of data growing at higher pace than single storage
disk capacity growth, hence cluster of disk distributed over network is
necessary.
Scalable: Extends to handle growing data requirement.
Fault-Tolerant: Protects against increased failure probability due to
large number of disks by replication
16. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
Total Capacity 6 TB
17. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
18. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F1 F1
100MB 100MB 100MB
19. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1 F2 F3
20. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1
21. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2
22. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
23. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
24. HDFS
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
File.txt
300 MB
F1 F2 F3
100MB 100MB 100MB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 F3-R2
25. Map Reduce
Framework for writing applications that process large amounts of
structured and unstructured data in parallel, across a cluster of
thousands of machines, in a reliable and fault-tolerant manner.
29. Map Reduce
File.txt
75 MB
1/4 Hour to process 75 MB File
File.txt
75 MB
1/4 Hour to process 75 MB File
File.txt
75 MB
1/4 Hour to process 75MB File
1/4 Hour to process 75 MB File
File.txt
75 MB
30. Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
31. Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3
32. Map Reduce
Server 1
1 TB
Server 3
1 TB
Server 2
1 TB
Server 5
1 TB
Server 4
1 TB
Server 6
1 TB
F1-R1 F2-R1 F3-R1F1-R2 F2-R2 F3-R2F1-R3 F2-R3F3-R3 P1-R1 P2-R1 P3-R1
33. Map Reduce
• Handles tasks incase of server failures
• Distributes tasks evenly
• Tries to run tasks on the same server where the data block resides
34. YARN
Multi-tenancy - YARN allows multiple access engines (either open-source or
proprietary) to use Hadoop as the common standard for batch, interactive and real-
time engines that can simultaneously access the same data set.
Cluster utilization -YARN’s dynamic allocation of cluster resources improves utilization
over more static Map Reduce rules used in early versions of Hadoop.
Scalability - Data center processing power continues to rapidly expand. YARN’s
Resource Manager focuses exclusively on scheduling and keeps pace as clusters
expand to thousands of nodes managing petabytes of data.
Compatibility - Existing Map Reduce applications developed for Hadoop 1 can run
YARN without any disruption to existing processes that already work
36. Hadoop Ecosystem
Pig (scripting): Platform for analyzing large data sets. It is comprised of a high-
level language (Pig Latin) that is translapted to Map Reduce. Cuts down writing
code . Ideal for Extract-transform-load (ETL) data pipelines, research on raw
data, and iterative processing of data.
Hive (SQL). Provides data warehouse infrastructure, enabling data
summarization, ad- hoc query and analysis of large data sets. The query
language, HiveQL (HQL), is similar to SQL.
HCatalog (SQL). Table and storage management layer that provides users with
Pig, MapReduce and Hive with a relational view of data in HDFS . Provides REST
APIs so that external systems can access these tables' metadata.
37. Hadoop Ecosystem
Ambari : Provides an open operational framework for provisioning, managing
and monitoring Hadoop clusters.
Zookeeper : Provides distributed configuration service, a synchronization service
and a naming registry for distributed systems
Oozie : Enables Hadoop administrators to build complex data transformations out
of multiple component tasks, enabling greater control over complex jobs and also
making it easier to schedule repetitions of those jobs.
38. Hadoop Ecosystem
Tez leverages the MapReduce paradigm to enable the creation and execution of
more complex Directed Acyclic Graphs (DAG) of tasks. Tez eliminates unnecessary
tasks, synchronization barriers and reads-from and writes-to HDFS, speeding up
data processing across both small-scale/low-latency and large-scale/high-
throughput workloads
Spark : fast and general in memory processing engine that uses YARN as a
framework for deployment and can read/write data from HDFS.
39. Hadoop Ecosystem
Sqoop : Tool designed to transfer data between Hadoop and relational database
servers
HBase (NoSQL). Non-relational database that provides random real-time access
to data in very large tables. HBase provides transactional capabilities to Hadoop,
allowing users to conduct updates, inserts and deletes.
Flume : Distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of streaming data into HDFS