This document discusses stream computing from an engineer's perspective. It begins by contrasting batch and stream processing, noting that stream processing handles data one record at a time with an emphasis on latency over throughput. The document then explores how to achieve scalability, performance, durability and availability in stream processing systems. It notes the tradeoffs between these goals and discusses challenges like handling failures. Specific open-source stream processing systems like Storm, Flink and Apex are then analyzed in terms of how they work, strengths, weaknesses and failure handling. The document concludes by discussing using distributed databases for state management in stream processing applications.
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
It feels like every week there's a new open-source streaming platform out there. Yet, if you only look at the descriptions, performance metrics, or even the architecture, they all start to look exactly the same! In short, nothing really differentiates itself - whether it be Storm, Flink, Apex, GearPumk, Samza, KafkaStreams, AkkaStreams, or any of the other myriad technologies. So if they all look the same, how do you really pick a streaming platform to solve the problem that YOU have? This talk is about how to really compare these platforms, and it turns out that they do have their key differences, they're just not the ones you usually think about. The way that you need to compare these systems if you're building something to last, a well-engineered system, is to look at how they handle durability, availability, how easy they are to install and use, and how they deal with failures.
In the engineering world, we don’t always have the luxury of owning our data pipelines end to end. If only we could influence those outside components… Well, we tried, and this our story - replete with failure, discovery, and the serenity of enlightenment. Join us on our journey as we learned more than we ever wanted to know about compression in different Apache projects, deployed our own ingestion pipeline in Apache Flume, and ultimately unified these in a robust framework built on Apache Apex handling 1 TB of data per day. We end with some reflections on the joys and tribulations of the open source realm and some key lessons for other large applications atop multiple Apache solutions.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
It feels like every week there's a new open-source streaming platform out there. Yet, if you only look at the descriptions, performance metrics, or even the architecture, they all start to look exactly the same! In short, nothing really differentiates itself - whether it be Storm, Flink, Apex, GearPumk, Samza, KafkaStreams, AkkaStreams, or any of the other myriad technologies. So if they all look the same, how do you really pick a streaming platform to solve the problem that YOU have? This talk is about how to really compare these platforms, and it turns out that they do have their key differences, they're just not the ones you usually think about. The way that you need to compare these systems if you're building something to last, a well-engineered system, is to look at how they handle durability, availability, how easy they are to install and use, and how they deal with failures.
In the engineering world, we don’t always have the luxury of owning our data pipelines end to end. If only we could influence those outside components… Well, we tried, and this our story - replete with failure, discovery, and the serenity of enlightenment. Join us on our journey as we learned more than we ever wanted to know about compression in different Apache projects, deployed our own ingestion pipeline in Apache Flume, and ultimately unified these in a robust framework built on Apache Apex handling 1 TB of data per day. We end with some reflections on the joys and tribulations of the open source realm and some key lessons for other large applications atop multiple Apache solutions.
Deep Learning on Apache® Spark™ : Workflows and Best PracticesJen Aman
The combination of Deep Learning with Apache Spark has the potential for tremendous impact in many sectors of the industry. This webinar, based on the experience gained in assisting customers with the Databricks Virtual Analytics Platform, will present some best practices for building deep learning pipelines with Spark.
Rather than comparing deep learning systems or specific optimizations, this webinar will focus on issues that are common to deep learning frameworks when running on a Spark cluster, including:
* optimizing cluster setup;
* configuring the cluster;
* ingesting data; and
* monitoring long-running jobs.
We will demonstrate the techniques we cover using Google’s popular TensorFlow library. More specifically, we will cover typical issues users encounter when integrating deep learning libraries with Spark clusters.
Clusters can be configured to avoid task conflicts on GPUs and to allow using multiple GPUs per worker. Setting up pipelines for efficient data ingest improves job throughput, and monitoring facilitates both the work of configuration and the stability of deep learning jobs.
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks
Using a live coding demonstration attendee’s will learn how to deploy scala spark jobs onto any kubernetes environment using helm and learn how to make their deployments more scalable and less need for custom configurations, resulting into a boilerplate free, highly flexible and stress free deployments.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
Committed to the goal of building open-source frameworks, tools, and algorithms that make building real-time applications decisions on live data with stronger security, The RISELab is set to innovate and enhance Spark
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Deep Learning to Production with MLflow & RedisAIDatabricks
Taking deep learning models to production and doing so reliably is one of the next frontiers of MLOps. With the advent of Redis modules and the availability of C APIs for the major deep learning frameworks, it is now possible to turn Redis into a reliable runtime for deep learning workloads, providing a simple solution for a model serving microservice. RedisAI is shipped with several cool features such as support for multiple frameworks, CPU and GPU backend, auto batching, DAGing, and soon will be with automatic monitoring abilities. In this talk, we'll explore some of these features of RedisAI and see how easy it is to integrate MLflow and RedisAI to build an efficient productionization pipeline.
Building large scale, job processing systems with Scala Akka Actor frameworkVignesh Sukumar
The Akka Actor framework is designed to be a fast message processing system. In this talk, we will explain how, at Box, we have used this framework to develop a large scale job processing system that works on billions of data files and achieves a high degree of throughput and fault tolerance. Over the course of the talk, we will explore the usage of Akka framework’s Supervisor functionality to provide a more controllable fault-tolerance strategy, and how we can use Futures to manage asynchronous jobs.
Deploying Apache Spark Jobs on Kubernetes with Helm and Spark OperatorDatabricks
Using a live coding demonstration attendee’s will learn how to deploy scala spark jobs onto any kubernetes environment using helm and learn how to make their deployments more scalable and less need for custom configurations, resulting into a boilerplate free, highly flexible and stress free deployments.
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
If you’re building a SaaS application, you probably already have the notion of tenancy built in your data model. Typically, most information relates to tenants / customers / accounts and your database tables capture this natural relation.
With smaller amounts of data, it’s easy to throw more hardware at the problem and scale up your database. As these tables grow however, you need to think about ways to scale your multi-tenant (B2B) database across dozens or hundreds of machines.
In this talk, we're first going to talk about motivations behind scaling your SaaS (multi-tenant) database and several heuristics we found helpful on deciding when to scale. We'll then describe three design patterns that are common in scaling SaaS databases: (1) Create one database per tenant, (2) Create one schema per tenant, and (3) Have all tenants share the same table(s). Next, we'll highlight the tradeoffs involved with each design pattern and focus on one pattern that scales to hundreds of thousands of tenants. We'll also share an example architecture from the industry that describes this pattern in more detail.
Last, we'll talk about key PostgreSQL properties, such as semi-structured data types, that make building multi-tenant applications easy. We'll also mention Citus as a method to scale out your multi-tenant database. We'll conclude by answering frequently asked questions on multi-tenant databases and Q&A.
Spark Summit EU 2016: The Next AMPLab: Real-time Intelligent Secure ExecutionDatabricks
Committed to the goal of building open-source frameworks, tools, and algorithms that make building real-time applications decisions on live data with stronger security, The RISELab is set to innovate and enhance Spark
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
Apache Spark on Supercomputers: A Tale of the Storage Hierarchy with Costin I...Databricks
In this session, the speakers will discuss their experiences porting Apache Spark to the Cray XC family of supercomputers. One scalability bottleneck is in handling the global file system present in all large-scale HPC installations. Using two techniques (file open pooling, and mounting the Spark file hierarchy in a specific manner), they were able to improve scalability from O(100) cores to O(10,000) cores. This is the first result at such a large scale on HPC systems, and it had a transformative impact on research, enabling their colleagues to run on 50,000 cores.
With this baseline performance fixed, they will then discuss the impact of the storage hierarchy and of the network on Spark performance. They will contrast a Cray system with two levels of storage with a “data intensive” system with fast local SSDs. The Cray contains a back-end global file system and a mid-tier fast SSD storage. One conclusion is that local SSDs are not needed for good performance on a very broad workload, including spark-perf, TeraSort, genomics, etc.
They will also provide a detailed analysis of the impact of latency of file and network I/O operations on Spark scalability. This analysis is very useful to both system procurements and Spark core developers. By examining the mean/median value in conjunction with variability, one can infer the expected scalability on a given system. For example, the Cray mid-tier storage has been marketed as the magic bullet for data intensive applications. Initially, it did improve scalability and end-to-end performance. After understanding and eliminating variability in I/O operations, they were able to outperform any configurations involving mid-tier storage by using the back-end file system directly. They will also discuss the impact of network performance and contrast results on the Cray Aries HPC network with results on InfiniBand.
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
Recently, there has been increased interest in running analytics and machine learning workloads on top of serverless frameworks in the cloud. The serverless execution model provides fine-grained scaling and unburdens users from having to manage servers, but also adds substantial performance overheads due to the fact that all data and intermediate state of compute task is stored on remote shared storage.
In this talk I first provide a detailed performance breakdown from a machine learning workload using Spark on AWS Lambda. I show how the intermediate state of tasks — such as model updates or broadcast messages — is exchanged using remote storage and what the performance overheads are. Later, I illustrate how the same workload performs on-premise using Apache Spark and Apache Crail deployed on a high-performance cluster (100Gbps network, NVMe Flash, etc.). Serverless computing simplifies the deployment of machine learning applications. The talk shows that performance does not need to be sacrificed.
Deep Learning to Production with MLflow & RedisAIDatabricks
Taking deep learning models to production and doing so reliably is one of the next frontiers of MLOps. With the advent of Redis modules and the availability of C APIs for the major deep learning frameworks, it is now possible to turn Redis into a reliable runtime for deep learning workloads, providing a simple solution for a model serving microservice. RedisAI is shipped with several cool features such as support for multiple frameworks, CPU and GPU backend, auto batching, DAGing, and soon will be with automatic monitoring abilities. In this talk, we'll explore some of these features of RedisAI and see how easy it is to integrate MLflow and RedisAI to build an efficient productionization pipeline.
Building large scale, job processing systems with Scala Akka Actor frameworkVignesh Sukumar
The Akka Actor framework is designed to be a fast message processing system. In this talk, we will explain how, at Box, we have used this framework to develop a large scale job processing system that works on billions of data files and achieves a high degree of throughput and fault tolerance. Over the course of the talk, we will explore the usage of Akka framework’s Supervisor functionality to provide a more controllable fault-tolerance strategy, and how we can use Futures to manage asynchronous jobs.
Getting Deep on Orchestration - Nickoloff - DockerCon16allingeek
Orchestration platforms let us work with higher level ideas like services and jobs; but there is more to a platform than scheduling and service discovery. A platform is a collection of actors and APIs that work together and provide those higher level abstractions on a distributed system. In this session we'll go deep on the architecture of open source orchestration platforms, consider scaling pains, reveal extension points, and reflect on an orchestration platform at Amazon. We'll finish with a demo of a homemade abstraction for failure injection by policy.
This is the story of how we managed to scale and improve Tappsi’s RoR RESTful API to handle our ever-growing load - told from different perspectives: infrastructure, data storage tuning, web server tuning, RoR optimization, monitoring and architecture design.
Performance Benchmarking: Tips, Tricks, and Lessons LearnedTim Callaghan
Presentation covering 25 years worth of lessons learned while performance benchmarking applications and databases. Presented at Percona Live London in November 2014.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Hierarchical Digital Twin of a Naval Power SystemKerry Sado
A hierarchical digital twin of a Naval DC power system has been developed and experimentally verified. Similar to other state-of-the-art digital twins, this technology creates a digital replica of the physical system executed in real-time or faster, which can modify hardware controls. However, its advantage stems from distributing computational efforts by utilizing a hierarchical structure composed of lower-level digital twin blocks and a higher-level system digital twin. Each digital twin block is associated with a physical subsystem of the hardware and communicates with a singular system digital twin, which creates a system-level response. By extracting information from each level of the hierarchy, power system controls of the hardware were reconfigured autonomously. This hierarchical digital twin development offers several advantages over other digital twins, particularly in the field of naval power systems. The hierarchical structure allows for greater computational efficiency and scalability while the ability to autonomously reconfigure hardware controls offers increased flexibility and responsiveness. The hierarchical decomposition and models utilized were well aligned with the physical twin, as indicated by the maximum deviations between the developed digital twin hierarchy and the hardware.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
2. Batch vs. Stream
• Batch
• Process chunks of data instead of one at a time
• Throughput over latency (seconds, minutes, hours)
• E.g. MapReduce, Spark, Tez
• Stream
• Data processed one at a time
• Latency over throughput (microseconds, milliseconds)
• E.g. Storm, Flink, Apex, KafkaStreams, GearPump
3. Scalability, Performance, Durability, Availability
• How do we handle more data?
• Quickly?
• Without ever losing data or compute?
• And ensure the system keeps working, even if there are failures?
4.
5. What are the tradeoffs?
• If we focus on scalability, it’s harder to guarantee
• Durability – more moving pieces, more coordination, more failures
• Availability – more failures, harder to stay operational
• Performance – bottlenecks and synchronization
• If we focus on availability, it’s harder to guarantee
• Performance – monitoring and synchronization overhead
• Scalability and performance
• Durability – must recover without losing data
• If we focus on durability, it’s harder to guarantee
• Performance
• Scalability
6. Batch compute has it easy.
• Get scale-out and performance by adding hardware and taking longer
• Get durability with a durable data store and recompute
• Get availability by taking longer to recover (this makes life easier!)
• In stream processing, you don’t have time!
7. It’s not about performance and scale.
• Most platforms handle large volume of data relatively quickly
• It’s about:
• Ease of use – how quickly can I build a complex application? Not word count.
• Failure-handling – what happens when things break?
• Durability – how do I avoid losing data without sacrificing performance?
• Availability – how can I keep my system operational with a minimum of labor
and without sacrificing performance?
16. Where do the weakness come from?
• Nimbus was a single point of failure (fixed as of 1.0.0 release)
• Upstream bolt/spout failure triggers re-compute on entire tree
• Can only create parallel independent stream by having separate redundant
topologies
• Bolts/spouts share JVM Hard to debug
• Failed tuples cannot be replayed quicker than 1s (lower limit on Ack)
• No dynamic topologies
• Cannot add or remove applications without service interruption
• Poor resource sharing in large clusters
17.
18. Enter the Competition – Apache Flink
• Declarative functional API (like Spark)
• But, true streaming platform (sort of) with support for CEP
• Optimized query execution
• Weaknesses:
• Depends on network micro-batching under the hood!
• Not battle -tested
• Failures still affect the entire topology
22. So what’s different from Storm?
• Flink handles planning and optimization for you
• Abstracts lower level internals
• Clear semantics around windowing (which Storm has lacked)
• Failure handling is lightweight and fast!
• Exactly once processing (given appropriate connectors at start/end)
• Can run Storm
23. What can’t it do?
• Dynamically update topology
• Dynamically scale
• Recover from errors without stopping the entire DAG
• Allow fine-grained control of how data moves through the system –
locality, data partitioning, routing
• You can do these individually, but not all at once
• The high level API is a curse!
• Run in production (Maybe?)
26. Which are unique?
• Apache Beam (Google’s baby - unifies all the platforms)
• Apache Apex (Robust architecture, scalable, fast, durable)
• IBM InfoSphere Streams (proprietary, expensive, the best)
27. Let’s look at Apex
• Unique provenance
• Built for the business at Yahoo – not a research project
• Built for reliability and strict processing semantics, not performance
• Apex just works
• Strengths
• Dynamism
• Scalability
• Failure-handling
• Weaknesses
• No high-level API
• More complex architecture
33. So it’s the best? Sort of!
• Most robust failure-handling
• Allows fine-tuning of data flows and DAG setup
• Excellent exploratory UI
• But
• Learning curve
• No high-level API
• No machine learning support
• Built for business, not for simplicity
34. Streaming is great – what about state?
• What if I need to persist data?
• Across operators?
• Retrieve it quickly?
• Do complex analytics?
• And build models?
35. Why state?
• Historical features (e.g. spend amount over 30 days)
• Statistical aggregates
• Machine learning model training
• Why Cross operator? Because of how data is partitioned, allows
aggregation over multiple fields.
36. Distributed In-Memory Databases
• Can support low-latency streaming use cases
• Durability becomes complicated because memory is volatile
• Memory is expensive and limited
• Examples: Memcached, Redis, MemSQL, Ignite, Hazelcast, Distributed
Hash Tables
37.
38. Lab!
• Build and deploy a simple architecture on a streaming platform
• Ingest data
• Engineer features
• Build a model
• Score against the model
• Storm + H2O
• Model build and model score are two different steps
• H2O allows you to export your model as a POJO that can be added as Java
code in a Storm Bolt
39. Goals
• Demonstrate parallel feature computation
• Demonstrate model creation and export using H2O
• Given a labeled data-set (e.g. Titanic) generate a set of scores from
running the model within the Storm topology
• Validate the generated results against a validation dataset (Storm or
offline)
40. Plan of attack
• Step 0:
• Storm topology, executing a model (could be linear regression you coded
yourself), locally on a single node.
• Step 1:
• Storm topology, executing an H2O model locally on a single node
• Step 2:
• Storm topology, executing an H2O model, on multiple nodes (real or virtual)
• Step 3 (Extra credit):
• Install Redis as a state store and use a Redis client to access Redis from Storm
41. Final Deliverable
• A report detailing your experience working with this technology
• What worked?
• What did not work?
• What was setup and usability like?
• What issues did you run into?
• How did you resolve these issues?
• Were you able to get the system operational?
• Were you able to get the results you wanted?