The need to handle increasingly large volumes of data, to quickly drive decisions (via streaming technologies and machine learning algorithms), to scale systems effectively, to guarantee the right level of continuity, to float data across systems efficiently and others are becoming critical and challenging requirements. During this talk we’ll demonstrate how to design reactive, resilient, message driven and elastic applications by combining technologies such as Akka, Kakfa, Cassandra and Spark along with architectural patterns like CQRS, ES, etc. in order to achieve the previously mentioned needs.
UKOUG 2011: MySQL Architectures for Oracle DBA'sFromDual GmbH
The document discusses MySQL architectures for Oracle DBAs. It begins with an introduction to FromDual, a MySQL consulting company. It then covers topics like the LAMP stack, the history and open source nature of MySQL, different MySQL branches and forks, moving from Oracle to MySQL, MySQL architecture including storage engines, differences between Oracle and MySQL, scale-up vs scale-out approaches, and high availability solutions. Specific MySQL architectures put in place for customers are also presented, covering domains like manufacturing, solar energy, and online trading.
Apache Cassandra is a massively scalable, highly available NoSQL database that provides continuous availability without compromising performance. It handles big data workloads across multiple data centers with no single point of failure and allows for fast, linear scalability as well as elasticity. Cassandra offers tunable data consistency, location independence for reading and writing data anywhere, and a familiar SQL-like language called CQL.
Skalierbarkeit mit MariaDB und MaxScale - MariaDB Roadshow Summer 2014 Hambur...MariaDB Corporation
Skalierbarkeit mit MariaDB und MaxScale
Presented by Ralf Gebhardt at the MariaDB Roadshow Germany: 4.7.2014 in Hamburg, 8.7.2014 in Berlin and 11.7.2014 in Frankfurt.
This document provides an overview of infrastructure as a service (IaaS) components and use cases. IaaS allows customers to provision compute, storage, and networking resources on-demand using a web interface or API. Key components include servers, storage, firewalls, load balancers, and networking. Common use cases are hybrid environments, legacy applications, containers, OpenStack, and disaster recovery. Some caveats are capacity and connectivity limitations of hardware, administration and performance of storage, firewall and load balancer capacity, backup definitions, and network design issues.
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)Alfranio Júnior
This document provides an overview and introduction to MySQL Fabric, a new high availability and distributed database solution from Oracle. The summary includes:
- MySQL Fabric is a distributed framework that allows farms of MySQL servers to be managed as highly available groups. It uses extensions and connectors to provide fault tolerance.
- Failure detection and failover works by having MySQL Fabric monitor the servers in an availability group. If the master fails, it will trigger a failover to promote a slave to become the new master.
- MySQL Fabric-aware connectors are available for Python, Java, and PHP that can route transactions, cache information, and handle failures by retrying operations on a different server if needed.
The document discusses methods for sharding MySQL databases. It begins with an introduction to sharding and the different types of sharding methods. It then provides details on building a large database cluster using the Palomino Cluster Tool, which utilizes configuration management tools like Ansible, Chef and Puppet. The document concludes with a section on administering the large database cluster using the open source tool Jetpants.
UKOUG 2011: MySQL Architectures for Oracle DBA'sFromDual GmbH
The document discusses MySQL architectures for Oracle DBAs. It begins with an introduction to FromDual, a MySQL consulting company. It then covers topics like the LAMP stack, the history and open source nature of MySQL, different MySQL branches and forks, moving from Oracle to MySQL, MySQL architecture including storage engines, differences between Oracle and MySQL, scale-up vs scale-out approaches, and high availability solutions. Specific MySQL architectures put in place for customers are also presented, covering domains like manufacturing, solar energy, and online trading.
Apache Cassandra is a massively scalable, highly available NoSQL database that provides continuous availability without compromising performance. It handles big data workloads across multiple data centers with no single point of failure and allows for fast, linear scalability as well as elasticity. Cassandra offers tunable data consistency, location independence for reading and writing data anywhere, and a familiar SQL-like language called CQL.
Skalierbarkeit mit MariaDB und MaxScale - MariaDB Roadshow Summer 2014 Hambur...MariaDB Corporation
Skalierbarkeit mit MariaDB und MaxScale
Presented by Ralf Gebhardt at the MariaDB Roadshow Germany: 4.7.2014 in Hamburg, 8.7.2014 in Berlin and 11.7.2014 in Frankfurt.
This document provides an overview of infrastructure as a service (IaaS) components and use cases. IaaS allows customers to provision compute, storage, and networking resources on-demand using a web interface or API. Key components include servers, storage, firewalls, load balancers, and networking. Common use cases are hybrid environments, legacy applications, containers, OpenStack, and disaster recovery. Some caveats are capacity and connectivity limitations of hardware, administration and performance of storage, firewall and load balancer capacity, backup definitions, and network design issues.
MySQL High Availability: Managing Farms of Distributed Servers (MySQL Fabric)Alfranio Júnior
This document provides an overview and introduction to MySQL Fabric, a new high availability and distributed database solution from Oracle. The summary includes:
- MySQL Fabric is a distributed framework that allows farms of MySQL servers to be managed as highly available groups. It uses extensions and connectors to provide fault tolerance.
- Failure detection and failover works by having MySQL Fabric monitor the servers in an availability group. If the master fails, it will trigger a failover to promote a slave to become the new master.
- MySQL Fabric-aware connectors are available for Python, Java, and PHP that can route transactions, cache information, and handle failures by retrying operations on a different server if needed.
The document discusses methods for sharding MySQL databases. It begins with an introduction to sharding and the different types of sharding methods. It then provides details on building a large database cluster using the Palomino Cluster Tool, which utilizes configuration management tools like Ansible, Chef and Puppet. The document concludes with a section on administering the large database cluster using the open source tool Jetpants.
As fast as a grid, as safe as a databasegojkoadzic
From the Gaming Scalability event, June 2009 in London (http://gamingscalability.org).
In this talk, Matthew Fowler from NT/e looks at the persistence issues on computing clouds. He discusses architectural principles and problems that cloud persistence presents to application developers and presents a possible solution, focusing on the key ideas, the tooling and the deployment options.
Matthew Fowler runs the Java business unit of New Technology/enterprise. Matthew received a BSc in Computer Science from MIT. He has developed and marketed products in many areas of software - LANs, WANs, software tools, language processors and generation of enterprise applications. His current interests are system generation and grid/cloud applications.
A Quick Guide to Sql Server Availability GroupsPio Balistoy
This document provides an overview of SQL Server availability groups. It defines availability groups as a high-availability and disaster recovery feature that allows failover between replicas of a group of databases. Key components include the primary replica that serves read-write transactions, secondary replicas that serve as failover targets, and an availability group listener that direct clients to the current primary replica. The document discusses advantages over previous solutions like database mirroring, as well as some compatibility limitations and administrative considerations.
MySQL Fabric: Easy Management of MySQL ServersMats Kindahl
MySQL Fabric is an open-source solution recently released by the MySQL Engineering team at Oracle. It seeks to make horizontal scale-out through sharding more accessible to users with growing data management requirements. This integrated framework supports management of large farms of MySQL servers, and includes support for sharding and high-availability. This is the presentation from Percona Live UK in London and it covers:
- Architecture for performance of a sharded deployment
- Management of MySQL server farms via MySQL Fabric
- MySQL Fabric as a tool for handling sharding and high-availability
- Application demands when working with a sharded database
- Connector demands when working with a sharded database
- Approaches to mixing sharded and global tables
Application Development with Apache Cassandra as a ServiceWSO2
WSO2 is an open source software company founded in 2005 that produces an entire middleware platform under the Apache license. Their business model involves selling comprehensive support and maintenance for their products. They have over 150 employees with offices globally. The document discusses using Apache Cassandra as a NoSQL database with WSO2's Column Store Service, including how to install the Cassandra feature, manage keyspaces and column families, and develop applications using the Java API Hector.
How Orwell built a geo-distributed Bank-as-a-Service with microservicesMariaDB plc
Orwell Group shares how they leveraged microservices, an event driven architecture and both master and reference data management methodologies to build a new banking system for high retail banking customers and corporate banks requiring cross border payments and cash flow management – and scaled it to handle customers with millions of clients. In particular they explain how they built a high availability, geo-distributed and consistent platform on top of MariaDB. The result was a secure and distributed platform with high cost efficiency, and the data accuracy and consistency needed to create high quality data pipelines from transactions to analytics and ensure regulatory compliance (e.g., GDPR).
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Migrating large databases at Facebook from InnoDB to MyRocks and HBase to MyRocks resulted in significant space savings of 2-4x and improved write performance by up to 10x. Various techniques were used for the migrations such as creating new MyRocks instances without downtime, loading data efficiently, testing on shadow instances, and promoting MyRocks instances as masters. Ongoing work involves optimizations like direct I/O, dictionary compression, parallel compaction, and dynamic configuration changes to further improve performance and efficiency.
This document discusses using solid state drives (SSDs) for server-side flash caching to improve performance. It covers SSD form factors for servers, the components of an SSD, deployment models for server-side flash including direct storage and pooled/replicated storage, use cases for server flash caching like databases and virtualization, and considerations for write-through versus write-back caching and live migration support. It also lists several vendors that provide server-side flash caching software.
Making MySQL Administration a Breeze - A look into a MySQL DBA's toolchest Lenz Grimmer
This document discusses various open source tools for MySQL administration. It describes toolkits like Maatkit and Kontrollkit which provide collections of scripts focused on MySQL. Individual tools covered include innotop for monitoring, mydumper for parallel backups, xtrabackup for online backups, and oak-security-audit for auditing. The document also discusses tools for user account administration, replication monitoring and maintenance, and performance tuning.
Deploying MariaDB databases with containers at Nokia NetworksMariaDB plc
Nokia is focused on providing software and products that facilitate rapid development, deployment and scaling of products and services to customers. The Common Software Foundation (CSF) within Nokia develops and supports product reuse by multiple applications within Nokia, including MariaDB. Their focus over the last year has been to develop a containerized MariaDB solution supporting multiple architectures, including both clustering and primary/secondary replication with MariaDB MaxScale. In this talk, Rick Lane discusses this journey of these containerized solutions from development to customer trials, including problems encountered and solutions.
MySQL Fabric is an open-source framework for the management of farms of servers. It is designed to be easy to use and available for both small and large server farms.
In order to create a solution that is truly resilient to failures, it is necessary to ensure redundancy of every component in the system and have a solid foundation for detecting and handling failures.
In this session, you will learn how to build a robust
high-availability solution using MySQL Fabric, what components you need and how they should be set up. You will learn how MySQL Fabric handles high-availability of the application servers and how to ensure high-availability of the Fabric system as a whole. You will also learn how to leverage, for example, OpenStack to ensure that the system keeps operating in the presence of failures.
This document provides an overview and agenda for a presentation on database sharding. It discusses how sharding can help with scaling databases to handle increasing load. It describes the key components of a sharded database solution like shards, switches, and state stores. It also covers important aspects of implementing sharding like transaction handling, mapping sharding keys, and handling queries across sharded tables.
Clustering MySQL is a mainstream technology to handle todays web loads. Regardless whether you choose MySQL Replication, MySQL Cluster or any other type of clustering solution you will need a load balancer. PECL/mysqlnd_ms 1.4 is a driver integrated load balancer for PHP. It works with all APIs, is free, semi-transparent, at the best possible layer in your stack and loaded with features. Get an overview of the latest development version 1.4.
Automated database failover is difficult due to the need for redundancy at both the server and data levels while supporting thousands of writes per second. While attempts have been made to automate failover using tools like DRBD, MySQL replication, and clustering solutions, there are many things that can still go wrong, such as replication not working properly, false failure detections, unreachable failed nodes, and performance issues with cold caches on new masters. Instead of focusing so much on automating failover, it may be better to design systems that are resilient to single failures without requiring failover in the first place, using techniques employed by systems like Amazon Dynamo.
MySQL 5.7 is GA. Here is the news about our NoSQL features in MySQL and MySQL Cluster, with a lot of emphasize on the new JSON features that make MySQL suitable as a document store.
MaxScale is an open-source, highly scalable, and transparent load balancing solution for MySQL and MariaDB databases. It acts as a proxy between applications and databases, authenticating clients, routing queries, and monitoring database nodes. MaxScale supports features like read/write splitting, connection load balancing, and filtering of queries through extensible plugin modules. Typical use cases include balancing read loads across database replicas and distributing connections among nodes in a Galera cluster.
This document discusses configuring and implementing a MariaDB Galera cluster for high availability on 3 Ubuntu servers. It provides steps to install MariaDB with Galera patches, configure the basic Galera settings, and start the cluster across the nodes. Key aspects covered include state transfers methods, Galera architecture, and important status variables for monitoring the cluster.
This is a slide deck that was used for our 11/19/15 Nike Tech Talk to give a detailed overview of the SnappyData technology vision. The slides were presented by Jags Ramnarayan, Co-Founder & CTO of SnappyData
As fast as a grid, as safe as a databasegojkoadzic
From the Gaming Scalability event, June 2009 in London (http://gamingscalability.org).
In this talk, Matthew Fowler from NT/e looks at the persistence issues on computing clouds. He discusses architectural principles and problems that cloud persistence presents to application developers and presents a possible solution, focusing on the key ideas, the tooling and the deployment options.
Matthew Fowler runs the Java business unit of New Technology/enterprise. Matthew received a BSc in Computer Science from MIT. He has developed and marketed products in many areas of software - LANs, WANs, software tools, language processors and generation of enterprise applications. His current interests are system generation and grid/cloud applications.
A Quick Guide to Sql Server Availability GroupsPio Balistoy
This document provides an overview of SQL Server availability groups. It defines availability groups as a high-availability and disaster recovery feature that allows failover between replicas of a group of databases. Key components include the primary replica that serves read-write transactions, secondary replicas that serve as failover targets, and an availability group listener that direct clients to the current primary replica. The document discusses advantages over previous solutions like database mirroring, as well as some compatibility limitations and administrative considerations.
MySQL Fabric: Easy Management of MySQL ServersMats Kindahl
MySQL Fabric is an open-source solution recently released by the MySQL Engineering team at Oracle. It seeks to make horizontal scale-out through sharding more accessible to users with growing data management requirements. This integrated framework supports management of large farms of MySQL servers, and includes support for sharding and high-availability. This is the presentation from Percona Live UK in London and it covers:
- Architecture for performance of a sharded deployment
- Management of MySQL server farms via MySQL Fabric
- MySQL Fabric as a tool for handling sharding and high-availability
- Application demands when working with a sharded database
- Connector demands when working with a sharded database
- Approaches to mixing sharded and global tables
Application Development with Apache Cassandra as a ServiceWSO2
WSO2 is an open source software company founded in 2005 that produces an entire middleware platform under the Apache license. Their business model involves selling comprehensive support and maintenance for their products. They have over 150 employees with offices globally. The document discusses using Apache Cassandra as a NoSQL database with WSO2's Column Store Service, including how to install the Cassandra feature, manage keyspaces and column families, and develop applications using the Java API Hector.
How Orwell built a geo-distributed Bank-as-a-Service with microservicesMariaDB plc
Orwell Group shares how they leveraged microservices, an event driven architecture and both master and reference data management methodologies to build a new banking system for high retail banking customers and corporate banks requiring cross border payments and cash flow management – and scaled it to handle customers with millions of clients. In particular they explain how they built a high availability, geo-distributed and consistent platform on top of MariaDB. The result was a secure and distributed platform with high cost efficiency, and the data accuracy and consistency needed to create high quality data pipelines from transactions to analytics and ensure regulatory compliance (e.g., GDPR).
Migrating from InnoDB and HBase to MyRocks at FacebookMariaDB plc
Migrating large databases at Facebook from InnoDB to MyRocks and HBase to MyRocks resulted in significant space savings of 2-4x and improved write performance by up to 10x. Various techniques were used for the migrations such as creating new MyRocks instances without downtime, loading data efficiently, testing on shadow instances, and promoting MyRocks instances as masters. Ongoing work involves optimizations like direct I/O, dictionary compression, parallel compaction, and dynamic configuration changes to further improve performance and efficiency.
This document discusses using solid state drives (SSDs) for server-side flash caching to improve performance. It covers SSD form factors for servers, the components of an SSD, deployment models for server-side flash including direct storage and pooled/replicated storage, use cases for server flash caching like databases and virtualization, and considerations for write-through versus write-back caching and live migration support. It also lists several vendors that provide server-side flash caching software.
Making MySQL Administration a Breeze - A look into a MySQL DBA's toolchest Lenz Grimmer
This document discusses various open source tools for MySQL administration. It describes toolkits like Maatkit and Kontrollkit which provide collections of scripts focused on MySQL. Individual tools covered include innotop for monitoring, mydumper for parallel backups, xtrabackup for online backups, and oak-security-audit for auditing. The document also discusses tools for user account administration, replication monitoring and maintenance, and performance tuning.
Deploying MariaDB databases with containers at Nokia NetworksMariaDB plc
Nokia is focused on providing software and products that facilitate rapid development, deployment and scaling of products and services to customers. The Common Software Foundation (CSF) within Nokia develops and supports product reuse by multiple applications within Nokia, including MariaDB. Their focus over the last year has been to develop a containerized MariaDB solution supporting multiple architectures, including both clustering and primary/secondary replication with MariaDB MaxScale. In this talk, Rick Lane discusses this journey of these containerized solutions from development to customer trials, including problems encountered and solutions.
MySQL Fabric is an open-source framework for the management of farms of servers. It is designed to be easy to use and available for both small and large server farms.
In order to create a solution that is truly resilient to failures, it is necessary to ensure redundancy of every component in the system and have a solid foundation for detecting and handling failures.
In this session, you will learn how to build a robust
high-availability solution using MySQL Fabric, what components you need and how they should be set up. You will learn how MySQL Fabric handles high-availability of the application servers and how to ensure high-availability of the Fabric system as a whole. You will also learn how to leverage, for example, OpenStack to ensure that the system keeps operating in the presence of failures.
This document provides an overview and agenda for a presentation on database sharding. It discusses how sharding can help with scaling databases to handle increasing load. It describes the key components of a sharded database solution like shards, switches, and state stores. It also covers important aspects of implementing sharding like transaction handling, mapping sharding keys, and handling queries across sharded tables.
Clustering MySQL is a mainstream technology to handle todays web loads. Regardless whether you choose MySQL Replication, MySQL Cluster or any other type of clustering solution you will need a load balancer. PECL/mysqlnd_ms 1.4 is a driver integrated load balancer for PHP. It works with all APIs, is free, semi-transparent, at the best possible layer in your stack and loaded with features. Get an overview of the latest development version 1.4.
Automated database failover is difficult due to the need for redundancy at both the server and data levels while supporting thousands of writes per second. While attempts have been made to automate failover using tools like DRBD, MySQL replication, and clustering solutions, there are many things that can still go wrong, such as replication not working properly, false failure detections, unreachable failed nodes, and performance issues with cold caches on new masters. Instead of focusing so much on automating failover, it may be better to design systems that are resilient to single failures without requiring failover in the first place, using techniques employed by systems like Amazon Dynamo.
MySQL 5.7 is GA. Here is the news about our NoSQL features in MySQL and MySQL Cluster, with a lot of emphasize on the new JSON features that make MySQL suitable as a document store.
MaxScale is an open-source, highly scalable, and transparent load balancing solution for MySQL and MariaDB databases. It acts as a proxy between applications and databases, authenticating clients, routing queries, and monitoring database nodes. MaxScale supports features like read/write splitting, connection load balancing, and filtering of queries through extensible plugin modules. Typical use cases include balancing read loads across database replicas and distributing connections among nodes in a Galera cluster.
This document discusses configuring and implementing a MariaDB Galera cluster for high availability on 3 Ubuntu servers. It provides steps to install MariaDB with Galera patches, configure the basic Galera settings, and start the cluster across the nodes. Key aspects covered include state transfers methods, Galera architecture, and important status variables for monitoring the cluster.
This is a slide deck that was used for our 11/19/15 Nike Tech Talk to give a detailed overview of the SnappyData technology vision. The slides were presented by Jags Ramnarayan, Co-Founder & CTO of SnappyData
Lesfurest.com invited me to talk about the KAPPA Architecture style during a BBL.
Kappa architecture is a style for real-time processing of large volumes of data, combining stream processing, storage, and serving layers into a single pipeline. It's different from the Lambda architecture, uses separate batch and stream processing pipelines.
Spark Streaming Recipes and "Exactly Once" Semantics RevisedMichael Spector
This document discusses stream processing with Apache Spark. It begins with an overview of Spark Streaming and its advantages over other frameworks like low latency and rich APIs. It then covers core Spark Streaming concepts like windowing and achieving "exactly once" semantics through checkpointing and write ahead logs. The document presents two examples of using Spark Streaming for analytics and aggregation with transactional and snapshotted approaches. It concludes with notes on deployment with Mesos/Marathon and performance tuning Spark Streaming jobs.
Demi Ben-Ari is a senior software engineer at Windward Ltd. who has a BS in computer science. They previously worked as a software team leader and senior Java engineer developing missile defense and alert systems. The presentation discusses Spark, an open-source cluster computing framework, and how Windward uses Spark for data filtering, management, predictions and more through Java applications running on YARN clusters.
Apache Spark is an open-source unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. Some key components of Apache Spark include Resilient Distributed Datasets (RDDs), DataFrames, Datasets, and Spark SQL for structured data processing. Spark also supports streaming, machine learning via MLlib, and graph processing with GraphX.
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
In this talk, we introduce the extensions of Spark Streaming to support (1) SQL-based query processing and (2) elastic-seamless resource allocation. First, we explain the methods of supporting window queries and query chains. As we know, last year, Grace Huang and Jerry Shao introduced the concept of “StreamSQL” that can process streaming data with SQL-like queries by adapting SparkSQL to Spark Streaming. However, we made advances in supporting complex event processing (CEP) based on their efforts. In detail, we implemented the sliding window concept to support a time-based streaming data processing at the SQL level. Here, to reduce the aggregation time of large windows, we generate an efficient query plan that computes the partial results by evaluating only the data entering or leaving the window and then gets the current result by merging the previous one and the partial ones. Next, to support query chains, we made the result of a query over streaming data be a table by adding the “insert into” query. That is, it allows us to apply stream queries to the results of other ones. Second, we explain the methods of allocating resources to streaming applications dynamically, which enable the applications to meet a given deadline. As the rate of incoming events varies over time, resources allocated to applications need to be adjusted for high resource utilization. However, the current Spark's resource allocation features are not suitable for streaming applications. That is, the resources allocated will not be freed when new data are arriving continuously to the streaming applications even though the quantity of the new ones is very small. In order to resolve the problem, we consider their resource utilization. If the utilization is low, we choose victim nodes to be killed. Then, we do not feed new data into the victims to prevent a useless recovery issuing when they are killed. Accordingly, we can scale-in/-out the resources seamlessly.
This document summarizes a presentation on extending Spark Streaming to support complex event processing. It discusses:
1) Motivations for supporting CEP in Spark Streaming, as current Spark is not enough to support continuous query languages or auto-scaling of resources.
2) Proposed solutions including extending Intel's Streaming SQL package, improving windowed aggregation performance, supporting "Insert Into" queries to enable query chains, and implementing elastic resource allocation through auto-scaling in/out of resources.
3) Evaluation of the Streaming SQL extensions showing low processing delays despite heavy loads or large windows, though more memory optimization is needed.
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
A presentation cum workshop on Real time Analytics with Apache Kafka and Apache Spark. Apache Kafka is a distributed publish-subscribe messaging while other side Spark Streaming brings Spark's language-integrated API to stream processing, allows to write streaming applications very quickly and easily. It supports both Java and Scala. In this workshop we are going to explore Apache Kafka, Zookeeper and Spark with a Web click streaming example using Spark Streaming. A clickstream is the recording of the parts of the screen a computer user clicks on while web browsing.
Spark Streaming with Kafka allows processing streaming data from Kafka in real-time. There are two main approaches - receiver-based and direct. The receiver-based approach uses Spark receivers to read data from Kafka and write to write-ahead logs for fault tolerance. The direct approach reads Kafka offsets directly without a receiver for better performance but less fault tolerance. The document discusses using Spark Streaming to aggregate streaming data from Kafka in real-time, persisting aggregates to Cassandra and raw data to S3 for analysis. It also covers using stateful transformations to update Cassandra in real-time.
Spark Streaming can be used to process streaming data from Kafka in real-time. There are two main approaches - the receiver-based approach where Spark receives data from Kafka receivers, and the direct approach where Spark directly reads data from Kafka. The document discusses using Spark Streaming to process tens of millions of transactions per minute from Kafka for an ad exchange system. It describes architectures where Spark Streaming is used to perform real-time aggregations and update databases, as well as save raw data to object storage for analytics and recovery. Stateful processing with mapWithState transformations is also demonstrated to update Cassandra in real-time.
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
Spark Streaming and Kafka Streams are two popular stream processing platforms. Spark Streaming uses micro-batching and allows for code reuse between batch and streaming jobs. Kafka Streams is embedded directly into Apache Kafka and leverages Kafka as its internal messaging layer. Both platforms support stateful stream processing operations like windowing, aggregations, and joins through distributed state stores. A demo application is shown that detects dangerous driving by joining truck position data with driver data using different streaming techniques.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
This document discusses Spark Streaming and its use for near real-time ETL. It provides an overview of Spark Streaming, how it works internally using receivers and workers to process streaming data, and an example use case of building a recommender system to find matches using both batch and streaming data. Key points covered include the streaming execution model, handling data receipt and job scheduling, and potential issues around data loss and (de)serialization.
Reactive app using actor model & apache sparkRahul Kumar
Developing Application with Big Data is really challenging work, scaling, fault tolerance and responsiveness some are the biggest challenge. Realtime bigdata application that have self healing feature is a dream these days. Apache Spark is a fast in-memory data processing system that gives a good backend for realtime application.In this talk I will show how to use reactive platform, Actor model and Apache Spark stack to develop a system that have responsiveness, resiliency, fault tolerance and message driven feature.
This introductory workshop is aimed at data analysts & data engineers new to Apache Spark and exposes them how to analyze big data with Spark SQL and DataFrames.
In this partly instructor-led and self-paced labs, we will cover Spark concepts and you’ll do labs for Spark SQL and DataFrames
in Databricks Community Edition.
Toward the end, you’ll get a glimpse into newly minted Databricks Developer Certification for Apache Spark: what to expect & how to prepare for it.
* Apache Spark Basics & Architecture
* Spark SQL
* DataFrames
* Brief Overview of Databricks Certified Developer for Apache Spark
Unified Big Data Processing with Apache SparkC4Media
Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/1yNuLGF.
Matei Zaharia talks about the latest developments in Spark and shows examples of how it can combine processing algorithms to build rich data pipelines in just a few lines of code. Filmed at qconsf.com.
Matei Zaharia is an assistant professor of computer science at MIT, and CTO of Databricks, the company commercializing Apache Spark.
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
O'Reilly Webcast with Myself and Evan Chan on the new SNACK Stack (playoff of SMACK) with FIloDB: Scala, Spark Streaming, Akka, Cassandra, FiloDB and Kafka.
Headaches and Breakthroughs in Building Continuous ApplicationsDatabricks
At SpotX, we have built and maintained a portfolio of Spark Streaming applications -- all of which process records in the millions per minute. From pure data ingestion, to ETL, to real-time reporting, to live customer-facing products and features, continuous applications are in our DNA. Come along with us as we outline our journey from square one to present in the world of Spark Streaming. We'll detail what we've learned about efficient processing and monitoring, reliability and stability, and long term support of a streaming app. Come learn from our mistakes, and leave with some handy settings and designs you can implement in your own streaming apps.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Similar to Stefano Rocco, Roberto Bentivoglio - Scala in increasingly demanding environments (20)
The presentation provides an introduction to the emulation world, in particular to the mythical Commodore 64 and its peripherals, like disk drive, printer, cartridges. To truly emulate the software written for this 8-bit home computer it is mandatory to be much accurate as possible and reproduce every single aspect of the real machine, starting from the chips that compose the hardware architecture. Beside the emulation topics the presentation faces some Scala performance issues that come up when you have to optimize low level operations. At the end I'll show you a demo where we'll see the emulator running a game and a demo-scene, one of the hardest software to emulate.
Scala.js is a compiler that compiles Scala source code to equivalent Javascript code. It can be seen as the start of a revolution in developing web application. In this talk, I'll present Scala.js project, common libraries for having a pleasure web development (such as scalatags, autowire and upickle) and integration with well known javascript libraries such as AngularJS and React.js.
Andrea Lattuada, Gabriele Petronella - Building startups on ScalaScala Italy
This document discusses how Buildo helps startups build scalable applications using Scala. It explains that Buildo started with a large project in Scala and learned how to apply Scala to startups. Buildo uses Scala because it values type safety, expressiveness, flexibility, and the ability to ship code that works faster. The document outlines how Buildo deals with diverse requirements using techniques like the cake pattern and Akka, develops controllers using monadic patterns, and discusses open problems regarding modularity and the importance of hiring skilled programmers.
During this session we will explore the component and the best practices behind a scalable Scala-based microservice architecture. We will see how to build a REST service with Akka and Spray, how to document its API with Swagger, how to package it with Sbt and Docker, how to deploy it with Mesos and Marathon and how to let him interact with other services with Bamboo. The talk with be full of code and practical tips.
The document summarizes the evolution of the Scala programming language from its origins to its present state and future directions. It discusses Scala's combination of object-oriented and functional programming, its type system, tooling improvements, and the emergent ecosystem around Scala. It also outlines plans to develop a Scala-specific platform called TASTY and explore new language concepts like effect systems to model side effects through capabilities.
Concurrency has always been a challenging area due to its complexity and unpredictability. Thanks to its Actor-Based Model, Akka provides a tool to tackle concurrency issues and easily create performant and scalable applications. In this talk we will describe what Akka is, what are the advantages of using it and how to use it -- or probably we should say how not to use it! We will also discuss Actors and their life cycle. Finally, we will provide examples of their four core operations: create, send, become and supervise actors.
Akka Streams is an implementation of Reactive Streams, which is a standard for asynchronous stream processing with non-blocking backpressure on the JVM. In this talk we'll cover the rationale behind Reactive Streams, and explore the different building blocks available in Akka Streams. I'll use Scala for all coding examples, but Akka Streams also provides a full-fledged Java8 API. After this session you will be all set and ready to reap the benefits of using Akka Streams!
Phil Calçado - Your microservice as a functionScala Italy
SoundCloud's microservice architecture is built mostly in Scala, using Finagle as its distributed systems workhorse. Finagle is an RPC system for the JVM, and it is based on a pipes-and-filters architecture that maps very nicely to functional programming concepts of higher-order functions and combinators. Over the past few years we have found that it is extremely useful to go even a step further and think of microservices as functions themselves. In this talk let's explore how SoundCloud uses Scala and Finagle, and how we started thinking of a microservices architecture as a special case of a functional system.
Scala meetup - Milan, 25 May 2013
Dopo anni di sviluppo Web in Java con le Servlet come base, vediamo come poter sviluppare usando le stesse solide fondamenta con Scala. Scalatra è un toolkit minimale che ci permette di iniziare ad approcciare Scala sul web attraverso le Servlet permettendoci di integrare Scala (o di migrare) in delle WebApp già esistenti. Vedremo in questo quickie come delle funzionalità generiche ed avanzate, implementate con SpringMVC, possono essere implementate con Scalatra.
Scala: the language of languages - Mario Fusco (Red Hat)Scala Italy
Scala meetup - Milan, 25 May 2013
Molte caratteristiche di Scala sono state appositamente pensate (o almeno lo sembrano) per favorire la sviluppo di Domain Specific Languages. Ad esempio le conversioni implicite e le funzioni con più set di parametri è possibile progettare DSL interni molto leggibili, mentre i parser combinator consentono di sviluppare DSL esterni con un minimo sforzo. Lo scopo del talk è quello di mostrare il funzionamento di queste tecniche e strumenti con esempi pratici.
Reflection in Scala Whats, Whys and Hows - Walter Cazzola (Dipartimento di In...Scala Italy
Scala meetup - Milan, 25 May 2013
Computational reflection is a mechanism that permits to do computations on
the computation itself. Scala before version 2.10 was relying on the Java
implementation for all the reflective computations but this has some
limitations especially on the support of types proper of Scala. In Scala
2.10 reflection becomes a native concept and covers the whole spectrum of
Scala concepts; it also introduces some specific reflective mechanisms as
macros. In this talk, we will explore what reflection is, why it is a
desirable mechanism for modern programming languages and to what extent and
how Scala 2.10 supports it.
Scala meetup - Milan, 25 May 2013
Una delle caratteristiche peculiari di Scala consiste nel semplificare e velocizzare la fase di sviluppo del software. Unificando la programmazione orientata agli oggetti con la programmazione funzionale, Scala permette di esprimervi in maniera concisa ed efficace. In questa presentazione saranno introdotte le caratteristiche principali del linguaggio e la sua filosofia, al fine di mostrare come del codice scritto in Scala risulti più semplice, corretto e manutenibile.
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-und-domino-lizenzkostenreduzierung-in-der-welt-von-dlau/
DLAU und die Lizenzen nach dem CCB- und CCX-Modell sind für viele in der HCL-Community seit letztem Jahr ein heißes Thema. Als Notes- oder Domino-Kunde haben Sie vielleicht mit unerwartet hohen Benutzerzahlen und Lizenzgebühren zu kämpfen. Sie fragen sich vielleicht, wie diese neue Art der Lizenzierung funktioniert und welchen Nutzen sie Ihnen bringt. Vor allem wollen Sie sicherlich Ihr Budget einhalten und Kosten sparen, wo immer möglich. Das verstehen wir und wir möchten Ihnen dabei helfen!
Wir erklären Ihnen, wie Sie häufige Konfigurationsprobleme lösen können, die dazu führen können, dass mehr Benutzer gezählt werden als nötig, und wie Sie überflüssige oder ungenutzte Konten identifizieren und entfernen können, um Geld zu sparen. Es gibt auch einige Ansätze, die zu unnötigen Ausgaben führen können, z. B. wenn ein Personendokument anstelle eines Mail-Ins für geteilte Mailboxen verwendet wird. Wir zeigen Ihnen solche Fälle und deren Lösungen. Und natürlich erklären wir Ihnen das neue Lizenzmodell.
Nehmen Sie an diesem Webinar teil, bei dem HCL-Ambassador Marc Thomas und Gastredner Franz Walder Ihnen diese neue Welt näherbringen. Es vermittelt Ihnen die Tools und das Know-how, um den Überblick zu bewahren. Sie werden in der Lage sein, Ihre Kosten durch eine optimierte Domino-Konfiguration zu reduzieren und auch in Zukunft gering zu halten.
Diese Themen werden behandelt
- Reduzierung der Lizenzkosten durch Auffinden und Beheben von Fehlkonfigurationen und überflüssigen Konten
- Wie funktionieren CCB- und CCX-Lizenzen wirklich?
- Verstehen des DLAU-Tools und wie man es am besten nutzt
- Tipps für häufige Problembereiche, wie z. B. Team-Postfächer, Funktions-/Testbenutzer usw.
- Praxisbeispiele und Best Practices zum sofortigen Umsetzen
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 5DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 5. In this session, we will cover CI/CD with devops.
Topics covered:
CI/CD with in UiPath
End-to-end overview of CI/CD pipeline with Azure devops
Speaker:
Lyndsey Byblow, Test Suite Sales Engineer @ UiPath, Inc.
HCL Notes and Domino License Cost Reduction in the World of DLAUpanagenda
Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/
The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this!
We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model.
Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward.
These topics will be covered
- Reducing license cost by finding and fixing misconfigurations and superfluous accounts
- How do CCB and CCX licenses really work?
- Understanding the DLAU tool and how to best utilize it
- Tips for common problem areas, like team mailboxes, functional/test users, etc
- Practical examples and best practices to implement right away
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
“An Outlook of the Ongoing and Future Relationship between Blockchain Technologies and Process-aware Information Systems.” Invited talk at the joint workshop on Blockchain for Information Systems (BC4IS) and Blockchain for Trusted Data Sharing (B4TDS), co-located with with the 36th International Conference on Advanced Information Systems Engineering (CAiSE), 3 June 2024, Limassol, Cyprus.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Best 20 SEO Techniques To Improve Website Visibility In SERPPixlogix Infotech
Boost your website's visibility with proven SEO techniques! Our latest blog dives into essential strategies to enhance your online presence, increase traffic, and rank higher on search engines. From keyword optimization to quality content creation, learn how to make your site stand out in the crowded digital landscape. Discover actionable tips and expert insights to elevate your SEO game.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
4. The picture
Highly demanding environments
- Data is increasing dramatically
- Applications are needed faster than ever
- Customers are more demanding
- Customers are becoming more sophisticated
- Services are becoming more sophisticated and complex
- Performance & Quality is becoming a must
- Rate of business change is ever increasing
- And more…
6. We need to embrace change!
Introduction – The world is changing…
7. Introduction - Real Time “Bidding”
High level architecture
Akka
Persistence
Input
Output
Cassandra
Kafka
Training PredictionScoring
SparkBatch
Real Time
Action
Dispatch
Publish
Store
Journaling
9. Multi-tier stereotypical architecture + CRUD
CQRS
Presentation Tier
Business Logic Tier
Data Tier
Integration
Tier
RDBMS
ClientSystems
ExternalSystems
DTO/VO
10. Multi-tier stereotypical architecture + CRUD
CQRS
- Pro
- Simplicity
- Tooling
- Cons
- Difficult to scale (RDBMS is usually the bottleneck)
- Domain Driven Design not applicable (using CRUD)
11. Think different!
CQRS
- Do we have a different architecture model without heavily rely on:
- CRUD
- RDBMS transactions
- J2EE/Spring technologies stack
12. Command and Query Responsibility Segregation
Originated with Bertrand Meyer’s Command and Query Separation Principle
“It states that every method should either be a command that performs an action, or a query that
returns data to the caller, but not both. In other words, asking a question should not change the
answer. More formally, methods should return a value only if they are referentially transparent
and hence possess no side effects” (Wikipedia)
CQRS
14. Available Services
- The service has been split into:
- Command → Write side service
- Query → Read side service
CQRS
Change status Status changed
Get status Status retrieved
15. Main architectural properties
- Consistency
- Command → consistent by definition
- Query → eventually consistent
- Data Storage
- Command → normalized way
- Query → denormalized way
- Scalability
- Command → low transactions rate
- Query → high transactions rate
CQRS
17. Storing Events…
Event Sourcing
Systems today usually rely on
- Storing of current state
- Usage of RDBMS as storage solution
Architectural choices are often “RDBMS centric”
Many systems need to store all the occurred events instead to store only the updated state
18. Commands vs Events
Event Sourcing
- Commands
- Ask to perform an operation (imperative tense)
- Can be rejected
- Events
- Something happened in the past (past tense)
- Cannot be undone
State mutationCommand validationCommand received Event persisted
19. Command and Event sourcing
Event Sourcing
An informal and short definition...
Append to a journal every commands (or
events) received (or generated) instead of
storing the current state of the application!
20. CRUD vs Event sourcing
Event Sourcing
Deposited 100 EUR Withdrawn
40 EUR
Deposited
200 EUR
- CRUD
- Account table keeps the current amount availability (260)
- Occoured events are stored in a seperated table
- Event Sourcing
- The current status is kept in-memory or by processing all events
- 100 – 40 + 200 => 260
Account created
21. Main properties
- There is no delete
- Performance and Scalability
- “Append only” model are easier to scale
- Horizontal Partitioning (Sharding)
- Rolling Snapshots
- No Impedance Mismatch
- Event Log can bring great business value
Event Sourcing
24. Main properties
- Akka persistence enables stateful actors to persiste their internal state
- Recover state after
- Actor start
- Actor restart
- JVM crash
- By supervisor
- Cluster migration
Akka Persistence
25. Main properties
- Changes are append to storage
- Nothing is mutated
- high transactions rates
- Efficient replication
- Stateful actors are recovered by replying store changes
- From the begging or from a snapshot
- Provides also P2P communication with at-least-once message delivery semantics
Akka Persistence
26. Components
- PersistentActor → persistent stateful actor
- Command or event sourced actor
- Persist commands/events to a journal
- PersistentView → Receives journaled messages written by another persistent actor
- AtLeastOnceDelivery → also in case of sender or receiver JVM crashes
- Journal → stores the sequence of messages sent to a persistent actor
- Snapshot store → are used for optimizing recovery times
Akka Persistence
27. Code example
class BookActor extends PersistentActor {
override val persistenceId: String = "book-persistence"
override def receiveRecover: Receive = {
case _ => // RECOVER AFTER A CRASH HERE...
}
override def receiveCommand: Receive = {
case _ => // VALIDATE COMMANDS AND PERSIST EVENTS HERE...
}
}
type Receive = PartialFunction[Any, Unit]
Akka Persistence
29. Apache Spark is a cluster computing platform designed to be fast and general-purpose
Spark SQL
Structured data
Spark Streaming
Real Time
Mllib
Machine Learning
GraphX
Graph Processing
Spark Core
Standalone Scheduler YARN Mesos
Apache Spark
The Stack
30. Apache Spark
The Stack
- Spark SQL: It allows querying data via SQL as well as the Apache Variant of SQL (HQL) and supports
many sources of data, including Hive tables, Parquet and JSON
- Spark Streaming: Components that enables processing of live streams of data in a elegant, fault tolerant,
scalable and fast way
- MLlib: Library containing common machine learning (ML) functionality including algorithms such as
classification, regression, clustering, collaborative filtering etc. to scale out across a cluster
- GraphX: Library for manipulating graphs and performing graph-parallel computation
- Cluster Managers: Spark is designed to efficiently scale up from one to many thousands of compute
nodes. It can run over a variety of cluster managers including Hadoop, YARN, Apache Mesos etc. Spark
has a simple cluster manager included in Spark itself called the Standalone Scheduler
32. Apache Spark
Core Concepts
- Every Spark application consists of a driver program that launches various parallel operations
on the cluster. The driver program contains your application’s main function and defines
distributed datasets on the cluster, then applies operations to them
- Driver programs access spark through the SparkContext object, which represents a connection
to a computing cluster.
- The SparkContext can be used to build RDDs (Resilient distributed datasets) on which you can
run a series of operations
- To run these operations, driver programs typically manage a number of nodes called executors
33. Apache Spark
RDD (Resilient Distributed Dataset)
It is an immutable distributed collection of data, which is partitioned across
machines in a cluster.
It facilitates two types of operations: transformation and action
-Resilient: It can be recreated when data in memory is lost
-Distributed: stored in memory across the cluster
-Dataset: data that comes from file or created programmatically
34. Apache Spark
Transformations
- A transformation is an operation such as map(), filter() or union on a RDD that yield
another RDD.
- Transformations are lazilly evaluated, in that the don’t run until an action is executed.
- Spark driver remembers the transformation applied to an RDD, so if a partition is lost,
that partition can easily be reconstructed on some other machine in the cluster.
(Resilient)
- Resiliency is achieved via a Lineage Graph.
35. Apache Spark
Actions
- Compute a result based on a RDD and either return it to the driver program
or save it to an external storage system.
- Typical RDD actions are count(), first(), take(n)
36. Apache Spark
Transformations vs Actions
RDD RDD
RDD Value
Transformations: define new RDDs based on current one. E.g. map, filter, reduce etc.
Actions: return values. E.g. count, sum, collect, etc.
37. Apache Spark
Benefits
Scalable Can be deployed on very large clusters
Fast In memory processing for speed
Resilient Recover in case of data loss
Written in Scala… has a simple high level API for Scala, Java and Python
38. Apache Spark
Lambda Architecture – One fits all technology!
New data
Batch Layer
Speed Layer
Serving Layer
Data
Consumers
Query
Spark
Spark
39. - Spark Streaming receives streaming input, and divides the data into batches which are then
processed by the Spark Core
Input data
Stream
Batches of input
data
Batches of
processed data
Spark Streaming Spark Core
Apache Spark
Speed Layer
40. val numThreads = 1
val group = "test"
val topicMap = group.split(",").map((_, numThreads)).toMap
val conf = new SparkConf().setMaster("local[*]").setAppName("KafkaWordCount")
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(2))
val lines = KafkaUtils.createStream(ssc, "localhost:2181", group,
topicMap).map(_._2)
val words = lines.flatMap(_.split(","))
val wordCounts = words.map { x => (x, 1L) }.reduceByKey(_ + _)
....
ssc.start()
ssc.awaitTermination()
Apache Spark – Streaming word count
example
Streaming with Spark and Kafka
42. Real Time “Bidding”
High level architecture
Akka
Persistence
Input
Output
Cassandra
Kafka
Training PredictionScoring
SparkBatch
Real Time
Action
Dispatch
Publish
Store
Journaling
43. Apache Kafka
Distributed messaging system
- Fast: Hight throughput for both publishing and subribing
- Scalable: Very easy to scale out
- Durable: Support persistence of messages
- Consumers are responsible to track their location in each log
Producer 1
Producer 2
Consumer
A
Consumer
B
Consumer
C
Partition 1
Partition 2
Partition 3
44. Apache Cassandra
Massively Scalable NoSql datastore
- Elastic Scalability
- No single point of failure
- Fast linear scale performance
1 Clients write to any Cassandra node
2 Coordinator node replicates to nodes and zones
3 Nodes returns ack to client
4 Data written to internal commit log disk
5 If a node goes offline, hinted handoff completes the write
when the node comes back up
- Regions = Datacenters
- Zones = Racks
Node
Node
Node
Node
Node
Node
Cluster
46. MILAN - 08TH OF MAY - 2015
PARTNERS
THANK YOU!
Stefano Rocco - @whispurr_it
Roberto Bentivoglio - @robbenti
@DATABIZit
PARTNERS
FAQ
We’re hiring!
Editor's Notes
Rensponsive -> The system responds in a timely manner if at all possible
Elastic -> The system stays responsive under varying workload
Resilient -> The system stays responsive in the face of failure
Message Driven -> Reactive Systems rely on asynchronous message-passing to establish a boundary between components that ensures loose coupling, isolation, location transparency, and provides the means to delegate errors as messages
Remember to mention and explain CRUD
Simplicity
- One could teach a Junior developer how to interact with a system built using this architecture in a very short period of time
- the architecture is completely generic.
Tooling (Framework)
- For instance ORM
Scaling
- RDBMS are at this point not horizontally scalable and vertically scaling becomes prohibitively expensive very quickly
DDD
- CRUD => Anemic Model (object containing only data and not behavior)
Method
command => perform an action (MUTATE THE STATE, WE HAVE HERE SIDE EFFECT)
query => return data to the caller (NO SIDE EFFECT, IT’S REFERENTIAL TRASPARENT)
In this slide you don’t need to introduce Event Sourcing but only to speak about command/write/left side vs query/read/right side
Explaining with others words/figures the meaning of Command and Query
Main properties of CQRS
An event is something that has happened in the past.
Remember to speak about the append on journal
Remember to the audience that having an append we don’t have deletion but we have events with opposite sign