The document discusses the journey towards building a real-time data platform capable of handling 2.5 million events per second. It describes migrating Spark processing from on-premises CDH to AWS EMR to improve scalability. Fault tolerance was added through batch processing in Spark and auto-recovery capabilities. Backpressure was enabled through Spark streaming, HDFS, and pulling data into Vertica to prevent overloading downstream systems. Monitoring was enhanced with a separate application to track pipeline metrics. The final platform achieved the performance goals through these architectural changes.
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
Ingesting data at scale into elasticsearch with apache pulsarTimothy Spann
Ingesting data at scale into elasticsearch with apache pulsar
FLiP
Flink, Pulsar, Spark, NiFi, ElasticSearch, MQTT, JSON
data ingest
etl
elt
sql
timothy spann
elasticsearch community conference
2/11/2022
developer advocate at streamnative
Spark Streaming API walk-through and insights of the dynamics of how it works. Presented at the Spark Belgium Meetup. (Presentation included live demo on backpressure)
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
Ingesting data at scale into elasticsearch with apache pulsarTimothy Spann
Ingesting data at scale into elasticsearch with apache pulsar
FLiP
Flink, Pulsar, Spark, NiFi, ElasticSearch, MQTT, JSON
data ingest
etl
elt
sql
timothy spann
elasticsearch community conference
2/11/2022
developer advocate at streamnative
Spark Streaming API walk-through and insights of the dynamics of how it works. Presented at the Spark Belgium Meetup. (Presentation included live demo on backpressure)
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
In these slides we'll be looking into Sprak Stream Processing with Kafka.
(Randall Hauch, Confluent) Kafka Summit SF 2018
The Kafka Connect framework makes it easy to move data into and out of Kafka, and you want to write a connector. Where do you start, and what are the most important things to know? This is an advanced talk that will cover important aspects of how the Connect framework works and best practices of designing, developing, testing and packaging connectors so that you and your users will be successful. We’ll review how the Connect framework is evolving, and how you can help develop and improve it.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent
The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, Schema Registry, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn't it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
Since mid-2016, Spark-as-a-Service has been available to researchers in Sweden from the Rise SICS ICE Data Center at www.hops.site. In this session, Dowling will discuss the challenges in building multi-tenant Spark structured streaming applications on YARN that are metered and easy-to-debug. The platform, called Hopsworks, is in an entirely UI-driven environment built with only open-source software. Learn how they use the ELK stack (Elasticsearch, Logstash and Kibana) for logging and debugging running Spark streaming applications; how they use Grafana and InfluxDB for monitoring Spark streaming applications; and, finally, how Apache Zeppelin can provide interactive visualizations and charts to end-users.
This session will also show how Spark applications are run within a ‘project’ on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In addition, hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.afka topics are protected from access by users that are not members of the project. We will also discuss the experiences of our users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
Cowboy dating with big data TechDays at Lohika-2020b0ris_1
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
Apache Spark has emerged over the past year as the imminent successor to Hadoop MapReduce. Spark can process data in memory at very high speed, while still be able to spill to disk if required. Spark’s powerful, yet flexible API allows users to write complex applications very easily without worrying about the internal workings and how the data gets processed on the cluster.
Spark comes with an extremely powerful Streaming API to process data as it is ingested. Spark Streaming integrates with popular data ingest systems like Apache Flume, Apache Kafka, Amazon Kinesis etc. allowing users to process data as it comes in.
In this talk, Hari will discuss the basics of Spark Streaming, its API and its integration with Flume, Kafka and Kinesis. Hari will also discuss a real-world example of a Spark Streaming application, and how code can be shared between a Spark application and a Spark Streaming application. Each stage of the application execution will be presented, which can help understand practices while writing such an application. Hari will finally discuss how to write a custom application and a custom receiver to receive data from other systems.
Meet Up - Spark Stream Processing + KafkaKnoldus Inc.
Stream processing is the real-time processing of data continuously, concurrently, and in a record-by-record fashion.
It treats data not as static tables or files, but as a continuous infinite stream of data integrated from both live and historical sources.
In these slides we'll be looking into Sprak Stream Processing with Kafka.
(Randall Hauch, Confluent) Kafka Summit SF 2018
The Kafka Connect framework makes it easy to move data into and out of Kafka, and you want to write a connector. Where do you start, and what are the most important things to know? This is an advanced talk that will cover important aspects of how the Connect framework works and best practices of designing, developing, testing and packaging connectors so that you and your users will be successful. We’ll review how the Connect framework is evolving, and how you can help develop and improve it.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
Apache Beam is an open source, unified model and set of language-specific SDKs for defining and executing data processing pipelines, and also data ingestion and integration flows, supporting for both batch and streaming use cases. In presentation I will provide a general overview of Apache Beam and programming model comparison Apache Beam vs Apache Spark.
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...confluent
The last 5 years, Kafka and Flink have become mature technologies that have allowed us to embrace the streaming paradigm. You can bet on them to build reliable and efficient applications. They are active projects backed by companies using them in production. They have a good community contributing, and sharing experience and knowledge. Kafka and Flink are solid choices if you want to build a data platform that your data scientists or developers can use to collect, process, and distribute data. You can put together Kafka Connect, Kafka, Schema Registry, and Flink. First, you will take care of their deployment. Then, for each case, you will setup each part, and of course develop the Flink job so it can integrate easily with the rest. Looks like a challenging but exciting project, isn't it? In this session, you will learn how you can build such data platform, what are the nitty-gritty of each part, how you can plug them together, in particular how to plug Flink in the Kafka ecosystem, what are the common pitfalls to avoid, and what it requires to be deployed on kubernetes. Even if you are not familiar with all the technologies, there will be enough introduction so you can follow. Come and learn how we can actually cross the streams!
Spark Streaming makes it easy to build scalable fault-tolerant streaming applications. In this webinar, developers will learn:
*How Spark Streaming works - a quick review.
*Features in Spark Streaming that help prevent potential data loss.
*Complementary tools in a streaming pipeline - Kafka and Akka.
*Design and tuning tips for Reactive Spark Streaming applications.
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
Storm as well as Spark Streaming are Open-Source Frameworks supporting distributed stream processing. Storm has been developed by Twitter and is a free and open source distributed real-time computation system that can be used with any programming language. It is written primarily in Clojure and supports Java by default. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. This presentation shows how you can implement stream processing solutions with the two frameworks, discusses how they compare and highlights the differences and similarities.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Data Pipeline with Kafka, This slide include
Kafka Introduction, Topic / Partitions, Produce / Consumer, Quick Start, Offset Monitoring, Example Code, Camus
Hive on spark is blazing fast or is it finalHortonworks
This presentation was given at the Strata + Hadoop World, 2015 in San Jose.
Apache Hive is the most popular and most widely used SQL solution for Hadoop. To keep pace with Hadoop’s increasingly vital role in the Enterprise, Hive has transformed from a batch-only, high-latency system into a modern SQL engine capable of both batch and interactive queries over large datasets. Hive’s momentum is accelerating: With Spark integration and a shift to in-memory processing on the horizon, Hive continues to expand the boundaries of Big Data.
In this talk the speakers examined Hive performance, past, present and future. In particular they looked at Hive’s origins as a petabyte scale SQL engine.
Through some numbers and graphs, they showed how Hive became 100x faster by moving beyond MapReduce, by vectorizing execution and by introducing a cost-based optimizer.
They detailed and discussed the challenges of scalable SQL on Hadoop.
The looked into Hive’s sub-second future, powered by LLAP and Hive on Spark.
And showed just how fast Hive on Spark really is.
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
Since mid-2016, Spark-as-a-Service has been available to researchers in Sweden from the Rise SICS ICE Data Center at www.hops.site. In this session, Dowling will discuss the challenges in building multi-tenant Spark structured streaming applications on YARN that are metered and easy-to-debug. The platform, called Hopsworks, is in an entirely UI-driven environment built with only open-source software. Learn how they use the ELK stack (Elasticsearch, Logstash and Kibana) for logging and debugging running Spark streaming applications; how they use Grafana and InfluxDB for monitoring Spark streaming applications; and, finally, how Apache Zeppelin can provide interactive visualizations and charts to end-users.
This session will also show how Spark applications are run within a ‘project’ on a YARN cluster with the novel property that Spark applications are metered and charged to projects. Projects are securely isolated from each other and include support for project-specific Kafka topics. That is, Kafka topics are protected from access by users that are not members of the project. In addition, hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.hear about the experiences of their users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and the novel solutions for helping researchers debug and optimize Spark applications.afka topics are protected from access by users that are not members of the project. We will also discuss the experiences of our users (over 150 users as of early 2017): how they manage their Kafka topics and quotas, patterns for how users share topics between projects, and our novel solutions for helping researchers debug and optimize Spark applications.
Cowboy dating with big data TechDays at Lohika-2020b0ris_1
The story about things that happen if data platforms are developed not by data engineers, what pitfalls and mistakes can be made.
This will help you to understand what data engineering is about.
In this slide deck, we go exploring the database landscape today and the common lego blocks that are used to build these different falvours of databses. We will dive through internals of a database, explore some choices and towards the end also explore some real world database architectures in view of the concepts (legos) we explored earlier.
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...Andrejs Prokopjevs
This presentation covers the idea of logical hostname feature and its possible use case with E-Business Suite, why it is a must-have configuration for DR, how it can improve your test/dev instance cloning and lifecycle processes, especially in a cloud deployment, support overview by 11i/R12.0/R12.1, and why it is a very hot topic right now for R12.2. Additionally, we will describe possible advanced configuration scenarios like container based virtualization. The content is based on real client environment implementation experience.
VMworld 2015: Advanced SQL Server on vSphereVMworld
Microsoft SQL Server is one of the most widely deployed “apps” in the market today and is used as the database layer for a myriad of applications, ranging from departmental content repositories to large enterprise OLTP systems. Typical SQL Server workloads are somewhat trivial to virtualize; however, business critical SQL Servers require careful planning to satisfy performance, high availability, and disaster recovery requirements. It is the design of these business critical databases that will be the focus of this breakout session. You will learn how build high-performance SQL Server virtual machines through proper resource allocation, database file management, and use of all-flash storage like XtremIO. You will also learn how to protect these critical systems using a combination of SQL Server and vSphere high availability features. For example, did you know you can vMotion shared-disk Windows Failover Cluster nodes? You can in vSphere 6! Finally, you will learn techniques for rapid deployment, backup, and recovery of SQL Server virtual machines using an all-flash array.
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward
Operationalizing Machine Learning models is never easy. Our team at Comcast has been challenged with operationalizing predictive ML models to improve customer care experiences. Using Apache Flink we have been able to apply real-time streaming to all aspects of the Machine Learning lifecycle. This includes data feature exploration and preparation by data scientists, deploying live models to serve near-real-time predictions, and validating results for model retraining and iteration. We will share best practices and lessons learned from Flink’s role in our operationalized lifecycle including:
• Executing as the “Prediction Pipeline” – a model container environment for near-real-time streaming and batch predictions
• Preparing streaming features and data sets for model training, as input for production model predictions, and for a continually-updated customer context
• Using connected streams and savepoints for “Live in the Dark”, multi-variant testing, and validation scenarios
• Incorporating Flink’s Queryable State as an approach to the online “Feature Store” – a data catalog for reuse by multiple models and use cases
• Enabling versioned models, versioned feature sets, and versioned data through DevOps approaches.
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingDibyendu Bhattacharya
My presentation at recently concluded Apache Big Data Conference Europe about the Reliable Low Level Kafka Spark Consumer I developed and an use case of real time indexing to Apache Blur using this consumer
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...Yahoo Developer Network
Splice Machine is an open-source database that combines the benefits of modern lambda architectures with the full expressiveness of ANSI-SQL. Like lambda architectures, it employs separate compute engines for different workloads - some call this an HTAP database (Hybrid Transactional and Analytical Platform). This talk describes the architecture and implementation of Splice Machine V2.0. The system is powered by a sharded key-value store for fast short reads and writes, and short range scans (Apache HBase) and an in-memory, cluster data flow engine for analytics (Apache Spark). It differs from most other clustered SQL systems such as Impala, SparkSQL, and Hive because it combines analytical processing with a distributed Multi-Value Concurrency Method that provides fine-grained concurrency which is required to power real-time applications. This talk will highlight the Splice Machine storage representation, transaction engine, cost-based optimizer, and present the detailed execution of operational queries on HBase, and the detailed execution of analytical queries on Spark. We will compare and contrast how Splice Machine executes queries with other HTAP systems such as Apache Phoenix and Apache Trafodian. We will end with some roadmap items under development involving new row-based and column-based storage encodings.
Speakers:
Monte Zweben, is a technology industry veteran. Monte’s early career was spent with the NASA Ames Research Center as the Deputy Chief of the Artificial Intelligence Branch, where he won the prestigious Space Act Award for his work on the Space Shuttle program. He then founded and was the Chairman and CEO of Red Pepper Software, a leading supply chain optimization company, which merged in 1996 with PeopleSoft, where he was VP and General Manager, Manufacturing Business Unit. In 1998, he was the founder and CEO of Blue Martini Software – the leader in e-commerce and multi-channel systems for retailers. Blue Martini went public on NASDAQ in one of the most successful IPOs of 2000, and is now part of JDA. Following Blue Martini, he was the chairman of SeeSaw Networks, a digital, place-based media company. Monte is also the co-author of Intelligent Scheduling and has published articles in the Harvard Business Review and various computer science journals and conference proceedings. He currently serves on the Board of Directors of Rocket Fuel Inc. as well as the Dean’s Advisory Board for Carnegie-Mellon’s School of Computer Science.
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
So you know you want to write a streaming app but any non-trivial streaming app developer would have to think about these questions:
How do I manage offsets?
How do I manage state?
How do I make my spark streaming job resilient to failures? Can I avoid some failures?
How do I gracefully shutdown my streaming job?
How do I monitor and manage (e.g. re-try logic) streaming job?
How can I better manage the DAG in my streaming job?
When to use checkpointing and for what? When not to use checkpointing?
Do I need a WAL when using streaming data source? Why? When don’t I need one?
In this talk, we’ll share practices that no one talks about when you start writing your streaming app, but you’ll inevitably need to learn along the way.
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Landon Robinson
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
Presented by Landon Robinson and Jack Chapa
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Amazon Web Services
In this session, learn how to leverage the unique features of the Amazon Aurora platform to build faster, more scalable database applications optimized for the cloud. We discuss architectural best practices and features, such as Aurora Serverless, Read Replica Auto-Scaling, Cross-Region Replicas, Backtrack, Fast Database Cloning, and Performance Insights. They are designed to help increase agility so you can develop applications faster to reach the widest possible audience. Through a hands-on lab, we help you understand how to best take advantage of the capabilities of the Aurora platform to effectively accelerate application development.
Slides from Big Data Meetup. Talk introduces quite new accelerator tool called Upsolver which revolutionally speeds up development of data platforms and makes cost of change quite low. The talk demonstrates key features, live demo, describes use cases where Upsolver is most applicable.
Learning from nature [slides from Software Architecture meetup]b0ris_1
Nature is the best Arhitect, many design decisions polished by millions years of evolution can be an unlimited source of inspiration fo software engineers. Presentation was focused around highlighting most prominent decisions that correlate with software world.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Courier management system project report.pdfKamal Acharya
It is now-a-days very important for the people to send or receive articles like imported furniture, electronic items, gifts, business goods and the like. People depend vastly on different transport systems which mostly use the manual way of receiving and delivering the articles. There is no way to track the articles till they are received and there is no way to let the customer know what happened in transit, once he booked some articles. In such a situation, we need a system which completely computerizes the cargo activities including time to time tracking of the articles sent. This need is fulfilled by Courier Management System software which is online software for the cargo management people that enables them to receive the goods from a source and send them to a required destination and track their status from time to time.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
About
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
Technical Specifications
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
Key Features
Indigenized remote control interface card suitable for MAFI system CCR equipment. Compatible for IDM8000 CCR. Backplane mounted serial and TCP/Ethernet communication module for CCR remote access. IDM 8000 CCR remote control on serial and TCP protocol.
• Remote control: Parallel or serial interface
• Compatible with MAFI CCR system
• Copatiable with IDM8000 CCR
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
Application
• Remote control: Parallel or serial interface.
• Compatible with MAFI CCR system.
• Compatible with IDM8000 CCR.
• Compatible with Backplane mount serial communication.
• Compatible with commercial and Defence aviation CCR system.
• Remote control system for accessing CCR and allied system over serial or TCP.
• Indigenized local Support/presence in India.
• Easy in configuration using DIP switches.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
2. Leading DWH @ Oath:
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME
5. DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
6. DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
7. DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
BIG DATA ?
8. API
DATA PLATFORM
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
11. DATA PLATFORM
INTRODUCING DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
REPORTING
ANALYTICS
Major mission: organize data
events
12. ZOOMING IN DATA PLATFORM
INGESTION
MODULE
REPORTING
SERVICE
WAREHOUSE
VALIDATION
ENRICHMENT
MODULE
RAW DATA
AGGREGATIONS
MODULE
FACTS
RAW DATA
DIMENSIONS
ANALYTICS
MODULE
CONFIGURATION
MODULE
DIMENSION
UPDATER
22. WHY WE NEEDED CHANGES
ROCKY SCALING
Adding/removing nodes to CDH YARN requires yarn restart and downtime for apps
Tricky to build quick sandboxes
The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes
Max supported rate limit 1M events/s, while Business required 2.5M/s
ZERO TOLERANCE
EC2 faulty nodes could make Spark or Memsql get stuck for a while
Buggy HA, even one faulty node could break entire Memsql cluster, make to recreate database and lose data
PUSH approach to write data to Memsql
MONITORING & ALERTING
Find the most relevant metrics
Eliminate FALSE POSITIVE and FALSE NEGATIVE errors
25. MIGRATING SPARK TO EMR
EASY CREATE, EASY DESTROY
• easy to … make bill cost a fortune
MULTIPLE EMR CLUSTERS
• Separating concerns and Isolation
• Better to run single application per EMR cluster
• Simplified auto-scaling rules
STATELESS EMR CLUSTERS
• Do not use local HDFS
26. CAUTION, EMR!
EASY TO ALLOCATE AND EASY TO LOSE EMR NODE
• Concerns mostly m4.4xl as the most popular instance type
LOSING MASTER NODE – LOSING ENTIRE CLUSTER
• Hard to build reliable platform involving multiple AZ [see Fleets model]
• Develop one-step evacuation procedure to another EMR
LUCK OF LACK ON SPECIFIC INSTANCE TYPE
• Can be mitigated by fleets model
33. CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Per second billing (!)
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Ordinary Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per AZ EMR cluster on demand as unit
of clustering
34.
35. MAKING SPARK WRITE FASTER
USING CUSTOM HADOOP COMMITER
• FileOutputCommiter committer with V2 option to exclude file moving in HDFS/S3
WRITE DATAFRAME TO HDFS FIRST
• Spark writes to HDFS directly into partitioned folder and registers new partition in
Hive
36. WRITING FASTER – FILE FORMATS
MOST STABLE PERFORMANCE ON ORC UNCOMPRESSED
• spark apps writes raw data in ORC
• presto reads ORC and writes aggregations in ORC
• replication uses ORC to send delta to Vertica
BEST PERFORMANCE ON HDFS BLOCK SIZE AND STRIP 64M
• Thankfully to strict retention policy 6 hours
ENABLING hive.orc.use-column-names=true
• simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name
• allows to evolve/modify schema for dataframe and database independently
37.
38. SPARK PERFORMANCE
ONE EXECUTOR PER YARN NODE
• for better cpu and cache utilization, using 16 vcores (aligning to m4.4xl)
ALIGN RDD PARTITIONS TO VCORES
• Repartition data we read from Kakfa [address if there is a skew in kafka partitions]
SPLIT PROCESSING BATCH INTERVAL ONTO RESPONSIBILITY ZONES
• Control each interval separately
FETCH FROM KAFKA ENRICHMENT WRITE TO HIVE
1 minute
8 seconds 20 seconds 20 seconds
STUFF/OVERHEAD
12 seconds
42. UNDER HOOD
Aggregations and replications are running every minute
Presto uses dimensions hosted outside. Using Memsql with realtime
updates
VERTICA
NODE
REPORTING
SERVICE
SPARK, EMR
HDFS
NODE
NODE
COLLOCATED HDFS/PRESTO
PRESTO
REPLICATORS
JENKINS SCHEDULER
MEMSQL
43.
44. FAULT TOLERANCE
EMR FLEETS MODEL
• New feature
• Allows to focus on cores instead of machines
• Allows provisioning nodes over multiple AZ
SPARK SPECULATION & BLACK LISTING
• Faulty nodes is total disaster (c)
• Spark Feature request to introduce minimal speculation interval (conflict with DirectCommiter)
45. FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
46. FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
BATCH 1
HDFS/HIVE
AGGREGATION TABLE
BATCH 2
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1
47. FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
BATCH 3
48. FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 3
BATCH 1 BATCH 2 BATCH 1
BATCH 3
49. FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates micro-batch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
BATCH 3
50. FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates micro-batch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACTS
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATED FACTS
REPLICATOR
VERTICA
AGGREGATED FACTS
BATCH 1 BATCH 2 BATCH 1
BATCH 3
BATCH 2
BATCH 3
54. BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
SPARK STREAMING BACKPRESSURE
• MUST HAVE for variable rate
• FEATURE contributed to Spark master with back pressure initial max rate for direct mode
55. BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
HDFS VALVE
• HDFS between Spark and Presto
• Retention policy 12h
56. BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
PULL WRITE
• Using Vertica’s query COPY from HDFS to let Vertica read data with own rate
57. BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
KAFKA OUTAGES
• Lua writes events directly to Kafka
• Unsent events stored locally and sent to S3
• NiFi periodically rends that data back to Kafka
60. MONITORING FUNDAMENTALS
FUNDAMENTAL REALTIME METRICS
• IN RATE
• OUT RATE
• CURRENT LAG
• ERRORS RATE
• BATCH PROCESSING TIME
• PIPELINE LATENCY
SEPARATED APP INTRODUCED [ aka BANDARLOG ]
• Tracks offsets for kafka, and Hive/Presto and Vertica
• Standalone application
• Open sourced soon
USING DATADOG
• Dashboards, monitors
64. WHAT WE HAVE ACHIEVED
SCLABLE PRODUCTION
• Ability to grow further beyond 1M/s up to 2.5M
STABLE PRODUCTION ENVIRONMENT
• fault tolerant components, easier to recover
LESS EXPENSIVE
• Smaller Spark cluster (-50%)
• Presto cluster is smaller than Memsql-driven one (30%)
SIMPLIFIED MAINTENANCE
• Auto recovery and scaling
• No wakeups over night