The document discusses event sourcing and CQRS architectures using technologies like Akka, Cassandra, and Spark. It provides an overview of how event sourcing avoids the limitations of traditional mutable databases by using an immutable write log. It describes how CQRS separates read and write concerns for better scalability. Example architectures show how Akka persistence can store events in Cassandra and provide views of data, while Spark can perform analytics on the full event stream.
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
CQRS (Command Query Responsibility Segregation) is a pattern, which separates the process of querying and updating data. As a query only returns data without any side effects, a command is designed to change data. CQRS is often combined with Event Sourcing. This is an architecture in which all changes to an application state are stored as a sequence of events.
Because of its great capability to store time series data Cassandra is the perfect fit for implementing the event store. But there a still a lot of open questions: What about the data modeling? What techniques will be used to process and store data in the Cassandra database? How to access the current state of the application, without replaying every event? And what about failure handling?
In this talk, I will give a brief introduction to CQRS and the Event Sourcing pattern and will then answer the questions above using a real life example of a data store for customer data.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes.
Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
codecentric AG: CQRS and Event Sourcing Applications with CassandraDataStax Academy
CQRS (Command Query Responsibility Segregation) is a pattern, which separates the process of querying and updating data. As a query only returns data without any side effects, a command is designed to change data. CQRS is often combined with Event Sourcing. This is an architecture in which all changes to an application state are stored as a sequence of events.
Because of its great capability to store time series data Cassandra is the perfect fit for implementing the event store. But there a still a lot of open questions: What about the data modeling? What techniques will be used to process and store data in the Cassandra database? How to access the current state of the application, without replaying every event? And what about failure handling?
In this talk, I will give a brief introduction to CQRS and the Event Sourcing pattern and will then answer the questions above using a real life example of a data store for customer data.
Bellevue Big Data meetup: Dive Deep into Spark StreamingSantosh Sahoo
Discuss the code and architecture about building realtime streaming application using Spark and Kafka. This demo presents some use cases and patterns of different streaming frameworks.
Strava Labs: Exploring a Billion Activity Dataset from Athletes with Apache S...Databricks
At Strava we have extensively leveraged Apache Spark to explore our data of over a billion activities, from tens of millions of athletes. This talk will be a survey of the more unique and exciting applications: A Global Heatmap gives a ~2 meter resolution density map of one billion runs, rides, and other activities consisting of three trillion GPS points from 17 billion miles of exercise data. The heatmap was rewritten from a non-scalable system into a highly scalable Spark job enabling great gains in speed, cost, and quality. Locally sensitive hashing for GPS traces was used to efficiently cluster 1 billion activities. Additional processes categorize and extract data from each cluster, such as names and statistics. Clustering gives an automated process to extract worldwide geographical patterns of athletes.
Applications include route discovery, recommendation systems, and detection of events and races. A coarse spatiotemporal index of all activity data is stored in Apache Cassandra. Spark streaming jobs maintain this index and compute all space-time intersections (“flybys”) of activities in this index. Intersecting activity pairs are then checked for spatiotemporal correlation, indicated by connected components in the graph of highly correlated pairs form “Group Activities”, creating a social graph of shared activities and workout partners. Data from several hundred thousand runners was used to build an improved model of the relationship between running difficulty and elevation gradient (Grade Adjusted Pace).
Reactive dashboard’s using apache sparkRahul Kumar
Apache Spark's Tutorial talk, In this talk i explained how to start working with Apache spark, feature of apache spark and how to compose data platform with spark. This talk also explains about reactive platform, tools and framework like Play, akka.
Distributed Stream Processing - Spark Summit East 2017Petr Zapletal
The demand for stream processing is increasing a lot these days. Immense amounts of data have to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
A number of powerful, easy-to-use open source platforms have emerged to address this. But the same problem can be solved differently, various but sometimes overlapping use-cases can be targeted or different vocabularies for similar concepts can be used. This may lead to confusion, longer development time or costly wrong decisions.
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
eBay has been using Analytical DBMS (ADBMS) data warehouse solution for over a decade, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. Based upon that, data services and products enables eBay business decisions and site features, so it has to be always available and accurate.
Apache Spark provides an open source and more scalable way of solution for such amount of data. Starting from beginning of this year, eBay has been working on migrating ADBMS batch workload to Spark, about 90% of them migrated in automatic way. Our team is leading the automation tools and pipeline to commit the accomplishment within this year.
In today’s session, we will introduce:
1. Tool sets which enables the auto migration engine: including metadata services, SQL convertor, Table/View generator, data mover, optimizer, pipeline generator, data validator, workflow controller many not only contributes in auto migration but also enables development work of individual engineers
2. End to end auto migration steps till cut over on production, starting from initializing on dev environment, unit test, data validation, integration test, release, parallel run, monitoring and cut over
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools.
About the Speakers
Remi Trouville Consultant, Independant
Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...DataStax Academy
At Hulu, we deal with scaling our web services to meet the demands of an ever growing number of users. During this talk, we will discuss our initial use case for cassandra at Hulu: the video progress tracking service known as hugetop. While cassandra provides a fantastic platform on which to build scalable applications, there are some dark corners of which to be cautious. We will provide a walkthrough of hugetop and some design decisions that went into the hugetop keyspace, our hardware choices, and our experiences operating cassandra in a high-traffic environment.
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
In this session we will examine a sample application that simulates an IoT stream that is handled through Kafka, Spark Streaming, and into Cassandra. The session will discuss the implementation details including the Kafka design considerations, Spark Steaming functionality including working with windowing to achieve analytics and finally Cassandra Time series data model considerations. The example is based on OSS Kafka and Integrated Spark and Cassandra in DSE.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
Hello Cronies,
Here are the slides of our recent meetup. .
Title: It's about Time: Deep dive into event store using Apache Cassandra
Big data At-A-Glance
· What is Big data?
· What we have seen so far in AJM Bigdata series?
· Refresher/Overview of basic terminology
· Where it is? Am I using it?
Introduction to Apache Cassandra
· What, When and Why of Apache Cassandra
· Protocol, Queries, Architecture and everything else
· Who is using Apache Cassandra
· Interesting use cases of Apache Cassandra ( Twitter/ Disqus/ etc.)
· Demo application walk-through
AWS Big Data Demystified #3 | Zeppelin + spark sql, jdbc + thrift, ganglia, r...Omid Vahdaty
AWS Big Data Demystified is all about knowledge sharing b/c knowledge should be given for free. in this lecture we will dicusss the advantages of working with Zeppelin + spark sql, jdbc + thrift, ganglia, r+ spark r + livy, and a litte bit about ganglia on EMR.\
subscribe to you youtube channel to see the video of this lecture:
https://www.youtube.com/channel/UCzeGqhZIWU-hIDczWa8GtgQ?view_as=subscriber
a comprehensive good introduction to the the Big data world in AWS cloud, hadoop, Streaming, batch, Kinesis, DynamoDB, Hbase, EMR, Athena, Hive, Spark, Piq, Impala, Oozie, Data pipeline, Security , Cost, Best practices
Big Data Day LA 2015 - Sparking up your Cassandra Cluster- Analytics made Awe...Data Con LA
After a brief technical introduction to Apache Cassandra we'll then go into the exciting world of Apache Spark integration, and learn how you can turn your transactional datastore into an analytics platform. Apache Spark has taken the Hadoop world by storm (no pun intended!), and is widely seen as the replacement to Hadoop Map Reduce. Apache Spark coupled with Cassandra are perfect allies, Cassandra does the distributed data storage, Spark does the distributed computation.
Amazon aws big data demystified | Introduction to streaming and messaging flu...Omid Vahdaty
amazon aws big data demystified meetup:
https://www.meetup.com/AWS-Big-Data-Demystified/
Introduction to streaming and messaging flume kafka sqs kinesis
Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB
Zeotap’s Connect product addresses the challenges of identity resolution and linking for AdTech and MarTech. Zeotap manages roughly 20 billion ID and growing. In their presentation, Zeotap engineers will delve into data access patterns, processing and storage requirements to make a case for a graph-based store. They will share the results of PoCs made on technologies such as D-graph, OrientDB, Aeropike and Scylla, present the reasoning for selecting JanusGraph backed by Scylla, and take a deep dive into their data model architecture from the point of ingestion. Learn what is required for the production setup, configuration and performance tuning to manage data at this scale.
The state of analytics has changed dramatically over the last few years. Hadoop is now commonplace, and the ecosystem has evolved to include new tools such as Spark, Shark, and Drill, that live alongside the old MapReduce-based standards. It can be difficult to keep up with the pace of change, and newcomers are left with a dizzying variety of seemingly similar choices. This is compounded by the number of possible deployment permutations, which can cause all but the most determined to simply stick with the tried and true. In this talk I will introduce you to a powerhouse combination of Cassandra and Spark, which provides a high-speed platform for both real-time and batch analysis.
Analytical DBMS to Apache Spark Auto Migration Framework with Edward Zhang an...Databricks
eBay has been using Analytical DBMS (ADBMS) data warehouse solution for over a decade, there are millions of batch queries running every day against 6000+ key DW tables, which contains over 22PB data (compressed) and still keeps booming every year. Based upon that, data services and products enables eBay business decisions and site features, so it has to be always available and accurate.
Apache Spark provides an open source and more scalable way of solution for such amount of data. Starting from beginning of this year, eBay has been working on migrating ADBMS batch workload to Spark, about 90% of them migrated in automatic way. Our team is leading the automation tools and pipeline to commit the accomplishment within this year.
In today’s session, we will introduce:
1. Tool sets which enables the auto migration engine: including metadata services, SQL convertor, Table/View generator, data mover, optimizer, pipeline generator, data validator, workflow controller many not only contributes in auto migration but also enables development work of individual engineers
2. End to end auto migration steps till cut over on production, starting from initializing on dev environment, unit test, data validation, integration test, release, parallel run, monitoring and cut over
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
Lambda architecture is a data-processing architecture designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods. In Lambda architecture, the system involves three layers: batch processing, speed (or real-time) processing, and a serving layer for responding to queries, and each comes with its own set of requirements.
In batch layer, it aims at perfect accuracy by being able to process the all available big dataset which is an immutable, append-only set of raw data using distributed processing system. Output will be typically stored in a read-only database with result completely replacing existing precomputed views. Apache Hadoop, Pig, and HIVE are
the de facto batch-processing system.
In speed layer, the data is processed in streaming fashion, and the real-time views are provided by the most recent data. As a result, the speed layer is responsible for filling the "gap" caused by the batch layer's lag in providing views based on the most recent data. This layer's views may not be as accurate as the views provided by batch layer's views created with full dataset, so they will be eventually replaced by the batch layer's views. Traditionally, Apache Storm is
used in this layer.
In serving layer, the result from batch layer and speed layer will be stored here, and it responds to queries in a low-latency and ad-hoc way.
One of the lambda architecture examples in machine learning context is building the fraud detection system. In speed layer, the incoming streaming data can be used for online learning to update the model learnt in batch layer to incorporate the recent events. After a while, the model can be rebuilt using the full dataset.
Why Spark for lambda architecture? Traditionally, different
technologies are used in batch layer and speed layer. If your batch system is implemented with Apache Pig, and your speed layer is implemented with Apache Storm, you have to write and maintain the same logics in SQL and in Java/Scala. This will very quickly becomes a maintenance nightmare. With Spark, we have an unified development framework for batch and speed layer at scale. In this talk, an end-to-end example implemented in Spark will be shown, and we will
discuss about the development, testing, maintenance, and deployment of lambda architecture system with Apache Spark.
We’ll present details about Argus, a time-series monitoring and alerting platform developed at Salesforce to provide insight into the health of infrastructure as an alternative to systems such as Graphite and Seyren.
Data processing platforms architectures with Spark, Mesos, Akka, Cassandra an...Anton Kirillov
This talk is about architecture designs for data processing platforms based on SMACK stack which stands for Spark, Mesos, Akka, Cassandra and Kafka. The main topics of the talk are:
- SMACK stack overview
- storage layer layout
- fixing NoSQL limitations (joins and group by)
- cluster resource management and dynamic allocation
- reliable scheduling and execution at scale
- different options for getting the data into your system
- preparing for failures with proper backup and patching strategies
NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch AnalysisHelena Edelson
Slides from my talk with Evan Chan at Strata San Jose: NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis. Streaming analytics architecture in big data for fast streaming, ad hoc and batch, with Kafka, Spark Streaming, Akka, Mesos, Cassandra and FiloDB. Simplifying to a unified architecture.
Elassandra: Elasticsearch as a Cassandra Secondary Index (Rémi Trouville, Vin...DataStax
Many companies use both elasticsearch and cassandra, typically in the form of logs or time series, but managing many softwares at a large scale can be quite challenging. Elassandra tightly integrates elasticsearch within cassandra as a secondary index, allowing near-realtime search with all existing elasticsearch APIs, plugins and tools like Kibana. We will present the core concepts of elassandra and explain how it draws benefit from internal cassandra features to make elasticsearch masterless, scalable with automatic resharding, more reliable and more efficient than deploying both softwares. We will also explore the bidirectional mapping : the way elasticsearch automatically creates the corresponding cassandra schema and the way elasticsearch indexes an existing cassandra table. Furthermore, we will share some use cases and benchmark results demonstrating practical use of elassandra to scale-out, re-index with zero-downtime, search and visualize data with various tools.
About the Speakers
Remi Trouville Consultant, Independant
Remi is an IT engineer who has worked for the last 8 years in the financial industry as a team manager responsible for all the call-center softwares managing the customer experience. At the end of this period, his team was dealing with 10,000+ agents with 100+ sites and some highly critical business processes such as storage of oral proof sales for transactions. He holds a Master's Degree in Telecommunication engineering and is now following an executive-MBA, in a French business school.
Cassandra Day SV 2014: Scaling Hulu’s Video Progress Tracking Service with Ap...DataStax Academy
At Hulu, we deal with scaling our web services to meet the demands of an ever growing number of users. During this talk, we will discuss our initial use case for cassandra at Hulu: the video progress tracking service known as hugetop. While cassandra provides a fantastic platform on which to build scalable applications, there are some dark corners of which to be cautious. We will provide a walkthrough of hugetop and some design decisions that went into the hugetop keyspace, our hardware choices, and our experiences operating cassandra in a high-traffic environment.
Feeding Cassandra with Spark-Streaming and KafkaDataStax Academy
In this session we will examine a sample application that simulates an IoT stream that is handled through Kafka, Spark Streaming, and into Cassandra. The session will discuss the implementation details including the Kafka design considerations, Spark Steaming functionality including working with windowing to achieve analytics and finally Cassandra Time series data model considerations. The example is based on OSS Kafka and Integrated Spark and Cassandra in DSE.
Using the SDACK Architecture to Build a Big Data ProductEvans Ye
You definitely have heard about the SMACK architecture, which stands for Spark, Mesos, Akka, Cassandra, and Kafka. It’s especially suitable for building a lambda architecture system. But what is SDACK? Apparently it’s very much similar to SMACK except the “D" stands for Docker. While SMACK is an enterprise scale, multi-tanent supported solution, the SDACK architecture is particularly suitable for building a data product. In this talk, I’ll talk about the advantages of the SDACK architecture, and how TrendMicro uses the SDACK architecture to build an anomaly detection data product. The talk will cover:
1) The architecture we designed based on SDACK to support both batch and streaming workload.
2) The data pipeline built based on Akka Stream which is flexible, scalable, and able to do self-healing.
3) The Cassandra data model designed to support time series data writes and reads.
ScyllaDB: What could you do with Cassandra compatibility at 1.8 million reque...Data Con LA
Scylla is a new, open-source NoSQL data store with a novel design optimized for modern hardware, capable of 1.8 million requests per second per node, while providing Apache Cassandra compatibility and scaling properties. While conventional NoSQL databases suffer from latency hiccups, expensive locking, and low throughput due to low processor utilization, the Scylla design is based on a modern shared-nothing approach. Scylla runs multiple engines, one per core, each with its own memory, CPU and multi-queue NIC. The result is a NoSQL database that delivers an order of magnitude more performance, with less performance tuning needed from the administrator.
With extra performance to work with, NoSQL projects can have more flexibility to focus on other concerns, such as functionality and time to market. Come for the tech details on what Scylla does under the hood, and leave with some ideas on how to do more with NoSQL, faster.
Speaker bio
Don Marti is technical marketing manager for ScyllaDB. He has written for Linux Weekly News, Linux Journal, and other publications. He co-founded the Linux consulting firm Electric Lichen. Don is a strategic advisor for Mozilla, and has previously served as president and vice president of the Silicon Valley Linux Users Group and on the program committees for Uselinux, Codecon, and LinuxWorld Conference and Expo.
Hello Cronies,
Here are the slides of our recent meetup. .
Title: It's about Time: Deep dive into event store using Apache Cassandra
Big data At-A-Glance
· What is Big data?
· What we have seen so far in AJM Bigdata series?
· Refresher/Overview of basic terminology
· Where it is? Am I using it?
Introduction to Apache Cassandra
· What, When and Why of Apache Cassandra
· Protocol, Queries, Architecture and everything else
· Who is using Apache Cassandra
· Interesting use cases of Apache Cassandra ( Twitter/ Disqus/ etc.)
· Demo application walk-through
Akka persistence == event sourcing in 30 minutesKonrad Malawski
Akka 2.3 introduces akka-persistence, a wonderful way of implementing event-sourced applications. Let's give it a shot and see how DDD and Akka are a match made in heaven :-)
Developing functional domain models with event sourcing (sbtb, sbtb2015)Chris Richardson
Event sourcing persists each entity as a sequence of state changing event. An entity’s current state is derived by replaying the events. Event sourcing is a great way to implement event-driven micro services. When one service updates an entity, the new events are consumed by other services, which then update their own state. In this talk we describe how to implement business logic using a domain model that is based on event sourcing. You will learn how to write functional, immutable domain models in Scala. We will compare and contrast a hybrid OO/FP design with a purely functional approach. You will learn how Domain Driven Design concepts such as bounded contexts and aggregates fit in with event-driven microservices.
Developing event-driven microservices with event sourcing and CQRS (svcc, sv...Chris Richardson
Modern, cloud-native applications typically use a microservices architecture in conjunction with NoSQL and/or sharded relational databases. However, in order to successfully use this approach you need to solve some distributed data management problems including how to maintain consistency between multiple databases without using 2PC.
In this talk you will learn more about these issues and how to solve them by using an event-driven architecture. We will describe how event sourcing and Command Query Responsibility Segregation (CQRS) are a great way to realize an event-driven architecture. You will learn about a simple yet powerful approach for building, modern, scalable applications.
Avoiding the Pit of Despair - Event Sourcing with Akka and CassandraLuke Tillman
With Akka you take a complicated system and break it down into lots of smaller units (actors) that communicate by passing messages. A single actor system can easily scale to millions or tens of millions of actors running on many machines. As actors process messages, they build up internal state, and many times we want that state persisted somewhere. In this talk, we'll dive into the event sourcing API used to persist actor state in Akka and talk about how we build a data model to support it in Cassandra. At first, the data model seems pretty straightforward, but the more we dig in, the more we see that a couple of classic Cassandra anti-patterns are pushing us close to the Pit of Despair. We'll come up with a way to avoid these problems so we can go on building distributed systems happily ever after with Akka and Cassandra.
Webinar: Overcoming the Storage Challenges Cassandra and Couchbase CreateStorage Switzerland
NoSQL databases like Cassandra and Couchbase are quickly becoming key components of the modern IT infrastructure. But this modernization creates new challenges – especially for storage. Storage in the broad sense. In-memory databases perform well when there is enough memory available. However, when data sets get too large and they need to access storage, application performance degrades dramatically. Moreover, even if enough memory is available, persistent client requests can bring the servers to their knees.
Join Storage Switzerland and Plexistor where you will learn:
1. What is Cassandra and Couchbase?
2. Why organizations are adopting them?
3. What are the storage challenges they create?
4. How organizations attempt to workaround these challenges.
5. How to design a solution to these challenges instead of a workaround.
Developing a fast and scalable application for your fancy new startup is hard. Many factors are responsible for the slowness of a website, like network latency, webserver configuration or large assets, but as any developer involved with high volumes knows, the real bottleneck is the database. During the latest years a bunch of NoSQL solutions came to the rescue, each one with his pros and cons. Apache Cassandra is one of the most used and mature "Big Data" NoSQL, and is currently deployed on several projects by tech giants like Twitter, eBay and Netflix, due to its extremely high throughput, automatic replication and decentralization. During the session I'll talk about how to leverage Apache Cassandra best features and data modeling best practices for your web application projects to respond to huge peaks of traffic, using open source tools such as Zend Framework and phpcassa, and describing a large e-commerce project currently using Cassandra.
Using Time Window Compaction Strategy For Time Series WorkloadsJeff Jirsa
Cassandra is a great fit for high write use cases, which makes it a popular choice for storing time series and sensor-collection workloads. At Crowdstrike, we've been using Cassandra for just that purpose, collecting petabytes of expiring time series data. In this talk, I'll discuss compaction in time series workloads, and the TimeWindowCompactionStrategy we developed specifically for this purpose. I'll detail TWCS specific configuration properties, some lesser known compaction sub-properties that apply to all compaction strategies, and also cover other general tricks and tuning that are useful for very large time-series workloads.
Data in Motion: Streaming Static Data Efficiently 2Martin Zapletal
Updated version for SD Berlin 2016. Distributed streaming performance, consistency, reliable delivery, durability, optimisations, event time processing and other concepts discussed and explained on Akka Persistence and other examples.
AppSync.org: open-source patterns and code for data synchronization in mobile...Niko Nelissen
AppSync.org is an open-source project for mobile app developers, that provides patterns, algorithms and source code to implement data synchronization between mobile apps (clients) and backends (server or mBaaS platform).
Spreadshirt Platform - An Architectural OverviewJens Hadlich
This presentation gives an overview of Spreadshirt's platform from a high-level architectural point of view. It highlights some of the problems we had in the past and explains how we solved them. It also touches some challenges we are currently facing while growing to become a truly global e-commerce platform for customized apparel.
Cette présentation décrit un concept architecture qui n'est pas nouveau, la séparation des commande et des requête et un autre les événements comme source d'information.
Ensemble ils forment un duo imbattable pour développer des application performantes et robustes.
Cassandra Summit 2015: Real World DTCS For OperatorsJeff Jirsa
Real World DTCS For Operators
The introduction of DateTieredCompactionStrategy in late 2014 was a significant step forward in providing a viable compaction strategy for time series data, especially time series data that will be TTL'd out. DateTieredCompactionStrategy's introduction was met with genuine excitement, and its rapid adoption is testament to developers' and operators' desire to have data compacted in a way that better matches their write patterns.
However, DateTieredCompactionStrategy's features come with significant limitations. This talk will review our real world benchmarking and use cases for DTCS as a vehicle to discuss the implications of DateTieredCompactionStrategy on operational tasks such as repair, read-repair, bootstrapping, and especially DR recovery scenarios, and it will also discuss how those various limitations lead us to proposing an operations-friendly alternative to DateTieredCompactionStrategy.
Patterns and practices for real-world event-driven microservicesRachel Reese
Jet.com is an e-commerce startup competing with Amazon. We're heavy users of F#, and have based our architecture around Azure-based event-driven functional microservices. Over the last several months, we've schooled ourselves on what works and what doesn't for F# and microservices. This session will walk you through the lessons we have learned on our way to developing our platform.
Cassandra & puppet, scaling data at $15 per monthdaveconnors
Constant Contact shares lessons learned from DevOps approach to implementing Cassandra to manage social media data for over 400k small business customers. Puppet is the critical in our tool chain. Single most important factor was the willingness of Development and Operations to stretch beyond traditional roles and responsibilities.
Reflected Intelligence: Lucene/Solr as a self-learning data systemTrey Grainger
What if your search engine could automatically tune its own domain-specific relevancy model? What if it could learn the important phrases and topics within your domain, automatically identify alternate spellings (synonyms, acronyms, and related phrases) and disambiguate multiple meanings of those phrases, learn the conceptual relationships embedded within your documents, and even use machine-learned ranking to discover the relative importance of different features and then automatically optimize its own ranking algorithms for your domain?
In this presentation, you’ll learn you how to do just that - to evolving Lucene/Solr implementations into self-learning data systems which are able to accept user queries, deliver relevance-ranked results, and automatically learn from your users’ subsequent interactions to continually deliver a more relevant experience for each keyword, category, and group of users.
Such a self-learning system leverages reflected intelligence to consistently improve its understanding of the content (documents and queries), the context of specific users, and the relevance signals present in the collective feedback from every prior user interaction with the system. Come learn how to move beyond manual relevancy tuning and toward a closed-loop system leveraging both the embedded meaning within your content and the wisdom of the crowds to automatically generate search relevancy algorithms optimized for your domain.
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Stream data processing is becoming increasingly important to support business needs for faster time to insight and action with growing volume of information from more sources. Apache Apex (http://apex.apache.org/) is a unified big data in motion processing platform for the Apache Hadoop ecosystem. Apex supports demanding use cases with:
* Architecture for high throughput, low latency and exactly-once processing semantics.
* Comprehensive library of building blocks including connectors for Kafka, Files, Cassandra, HBase and many more
* Java based with unobtrusive API to build real-time and batch applications and implement custom business logic.
* Advanced engine features for auto-scaling, dynamic changes, compute locality.
Apex was developed since 2012 and is used in production in various industries like online advertising, Internet of Things (IoT) and financial services.
Scaling up uber's real time data analyticsXiang Fu
Realtime infrastructure powers critical pieces of Uber. This talk will discuss the architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka/Flink/Pinot) and in-house technologies have helped Uber scale and enabled SQL to power realtime decision making for city ops, data scientists, data analysts and engineers.
Serverless Event Streaming with Pulsar FunctionsStreamNative
The last few years have seen the emergence of Serverless as a paradigm for event streaming. Its very simple programming model has attracted developers in droves. At the same time, its ability to elastically scale has simplified operations significantly. Combined together with the ubiquity of their presence across all cloud providers, serverless today has become the leading choice to do event processing at scale for a lot of companies.
In this talk, Sijie Guo from StreamNative will explore how the serverless paradigm is applied to event streaming in Apache Pulsar, a next-generation event streaming system. Pulsar provides native support for serverless functions where the events are processed as soon as they arrive in a streaming manner and that provides flexible deployment options (thread, process, container). He will describe how these serverless functions make data engineering easier and share the real world usage of Pulsar Functions.
Will it Scale? The Secrets behind Scaling Stream Processing ApplicationsNavina Ramesh
This talk was presented at the Apache Big Data 2016, North America conference that was held in Vancouver, CA (http://events.linuxfoundation.org/events/archive/2016/apache-big-data-north-america/program/schedule)
Scala like distributed collections - dumping time-series data with apache sparkDemi Ben-Ari
Spark RDDs are almost identical to Scala collection, just in a distributed manner, all of the transformations and actions are derived from the Scala collections API.
As Martin Odersky mentioned, “Spark - The Ultimate Scala Collections” is the right way to look at RDDs. But with that great distributed power comes a great many data problems: at first you’ll start tackling the concept of partitioning, then the actual data becomes the next thing to worry about.
In the talk we’ll go through an overview on Spark's architecture, and see how similar RDDs are to the Scala collections API. We'll then shift to the world of problems that you’ll be facing when using Spark for processing a vast volume of time-series data with multiple data stores (S3, MongoDB, Apache Cassandra, MySQL).
When you start tackling many scale and performance problems, many questions arise:
> How to handle missing data?
> Should the system handle both serving and backend processes, or should we separate them out?
> Which solution is cheaper?
> How do we get the best performance for money spent?
In the talk we will tell the tale of all of the transformations we’ve made to our data and review the multiple data persistency layers... and I’ll try my best NOT to answer the question “which persistency layer is the best?” but I do promise to share our pains and lessons learned!
Incremental Processing on Large Analytical Datasets with Prasanna Rajaperumal...Databricks
Prasanna Rajaperumal and Vinoth Chandar will explore a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Prasanna will discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
Hoodie: How (And Why) We built an analytical datastore on SparkVinoth Chandar
Exploring a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Then, discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS.
https://spark-summit.org/2017/events/incremental-processing-on-large-analytical-datasets/
Similar to Cassandra as an event sourced journal for big data analytics Cassandra Summit 2015 (20)
How Disney+ uses fast data ubiquity to improve the customer experience Martin Zapletal
Disney+ uses Amazon Kinesis to drive real-time actions like providing title recommendations for customers, sending events across microservices, and delivering logs for operational analytics to improve the customer experience. In this session, you learn how Disney+ built real-time data-driven capabilities on a unified streaming platform. This platform ingests billions of events per hour in Amazon Kinesis Data Streams, processes and analyzes that data in Amazon Kinesis Data Analytics for Apache Flink, and uses Amazon Kinesis Data Firehose to deliver data to destinations without servers or code. Hear how these services helped Disney+ scale its viewing experience to tens of millions of customers with the required quality and reliability.
Learn more about re:Invent 2020 at http://bit.ly/3c4NSdY
Customer experience at disney+ through data perspectiveMartin Zapletal
Disney+ has rapidly scaled to provide a personalized and seamless experience to tens of millions of customers. This experience is powered by a robust data platform that ingests, processes and surfaces billions of events per hour using Delta lake, Databricks, and AWS technologies. The data produced by the platform is used by multitude of services including a recommendation engine for personalized experience, optimizing watch experience including group watch, and fraud and abuse prevention. In this session, you will learn how Disney+ built these capabilities, the architecture, technologies, design principles, and technical details that make it possible.
Using observability, logs, metrics and traces as a data source for supervised and reinforcement machine learning techniques with a goal to optimize large scale systems.
Intelligent Distributed Systems OptimizationsMartin Zapletal
This talk discusses techniques for achieving optimized performance, availability, cost or other attributes of a distributed system. Firstly, the presentation introduces and in depth explains optimization techniques used in state of the art large scale stream and fast data processing frameworks such as Akka Streams, Spark or Flink, including logical and physical optimizations or code generation. Consequently, powerful optimization concepts applicable to general distributed systems, including systems built using Akka, are explained on examples. Finally, the presentation highlights the role of machine learning and artificial intelligence in the area and explains how machine generated data such as logs and metrics can be used to model, minimize, maximize or find the perfect balance of selected attributes of the system, demonstrated on examples from practice. The attendees will gain an understanding of the available optimization approaches, tradeoffs and the value of machine learning and intelligence and ultimately will be able to apply some of the techniques to optimize general distributed systems as well as streaming data processing systems built using Spark, Flink or Akka Streams.
Data in Motion: Streaming Static Data EfficientlyMartin Zapletal
Distributed streaming performance, consistency, reliable delivery, durability, optimisations, event time processing and other concepts discussed and explained on Akka Persistence and other examples.
Experience our free, in-depth three-part Tendenci Platform Corporate Membership Management workshop series! In Session 1 on May 14th, 2024, we began with an Introduction and Setup, mastering the configuration of your Corporate Membership Module settings to establish membership types, applications, and more. Then, on May 16th, 2024, in Session 2, we focused on binding individual members to a Corporate Membership and Corporate Reps, teaching you how to add individual members and assign Corporate Representatives to manage dues, renewals, and associated members. Finally, on May 28th, 2024, in Session 3, we covered questions and concerns, addressing any queries or issues you may have.
For more Tendenci AMS events, check out www.tendenci.com/events
We describe the deployment and use of Globus Compute for remote computation. This content is aimed at researchers who wish to compute on remote resources using a unified programming interface, as well as system administrators who will deploy and operate Globus Compute services on their research computing infrastructure.
First Steps with Globus Compute Multi-User EndpointsGlobus
In this presentation we will share our experiences around getting started with the Globus Compute multi-user endpoint. Working with the Pharmacology group at the University of Auckland, we have previously written an application using Globus Compute that can offload computationally expensive steps in the researcher's workflows, which they wish to manage from their familiar Windows environments, onto the NeSI (New Zealand eScience Infrastructure) cluster. Some of the challenges we have encountered were that each researcher had to set up and manage their own single-user globus compute endpoint and that the workloads had varying resource requirements (CPUs, memory and wall time) between different runs. We hope that the multi-user endpoint will help to address these challenges and share an update on our progress here.
Software Engineering, Software Consulting, Tech Lead.
Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Security,
Spring Transaction, Spring MVC,
Log4j, REST/SOAP WEB-SERVICES.
Globus Connect Server Deep Dive - GlobusWorld 2024Globus
We explore the Globus Connect Server (GCS) architecture and experiment with advanced configuration options and use cases. This content is targeted at system administrators who are familiar with GCS and currently operate—or are planning to operate—broader deployments at their institution.
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTier1 app
Even though at surface level ‘java.lang.OutOfMemoryError’ appears as one single error; underlyingly there are 9 types of OutOfMemoryError. Each type of OutOfMemoryError has different causes, diagnosis approaches and solutions. This session equips you with the knowledge, tools, and techniques needed to troubleshoot and conquer OutOfMemoryError in all its forms, ensuring smoother, more efficient Java applications.
A Comprehensive Look at Generative AI in Retail App Testing.pdfkalichargn70th171
Traditional software testing methods are being challenged in retail, where customer expectations and technological advancements continually shape the landscape. Enter generative AI—a transformative subset of artificial intelligence technologies poised to revolutionize software testing.
Quarkus Hidden and Forbidden ExtensionsMax Andersen
Quarkus has a vast extension ecosystem and is known for its subsonic and subatomic feature set. Some of these features are not as well known, and some extensions are less talked about, but that does not make them less interesting - quite the opposite.
Come join this talk to see some tips and tricks for using Quarkus and some of the lesser known features, extensions and development techniques.
Why React Native as a Strategic Advantage for Startup Innovation.pdfayushiqss
Do you know that React Native is being increasingly adopted by startups as well as big companies in the mobile app development industry? Big names like Facebook, Instagram, and Pinterest have already integrated this robust open-source framework.
In fact, according to a report by Statista, the number of React Native developers has been steadily increasing over the years, reaching an estimated 1.9 million by the end of 2024. This means that the demand for this framework in the job market has been growing making it a valuable skill.
But what makes React Native so popular for mobile application development? It offers excellent cross-platform capabilities among other benefits. This way, with React Native, developers can write code once and run it on both iOS and Android devices thus saving time and resources leading to shorter development cycles hence faster time-to-market for your app.
Let’s take the example of a startup, which wanted to release their app on both iOS and Android at once. Through the use of React Native they managed to create an app and bring it into the market within a very short period. This helped them gain an advantage over their competitors because they had access to a large user base who were able to generate revenue quickly for them.
Check out the webinar slides to learn more about how XfilesPro transforms Salesforce document management by leveraging its world-class applications. For more details, please connect with sales@xfilespro.com
If you want to watch the on-demand webinar, please click here: https://www.xfilespro.com/webinars/salesforce-document-management-2-0-smarter-faster-better/
Listen to the keynote address and hear about the latest developments from Rachana Ananthakrishnan and Ian Foster who review the updates to the Globus Platform and Service, and the relevance of Globus to the scientific community as an automation platform to accelerate scientific discovery.
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Globus
The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Anthony Dahanne
Les Buildpacks existent depuis plus de 10 ans ! D’abord, ils étaient utilisés pour détecter et construire une application avant de la déployer sur certains PaaS. Ensuite, nous avons pu créer des images Docker (OCI) avec leur dernière génération, les Cloud Native Buildpacks (CNCF en incubation). Sont-ils une bonne alternative au Dockerfile ? Que sont les buildpacks Paketo ? Quelles communautés les soutiennent et comment ?
Venez le découvrir lors de cette session ignite
Enhancing Research Orchestration Capabilities at ORNL.pdfGlobus
Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.
Your Digital Assistant.
Making complex approach simple. Straightforward process saves time. No more waiting to connect with people that matter to you. Safety first is not a cliché - Securely protect information in cloud storage to prevent any third party from accessing data.
Would you rather make your visitors feel burdened by making them wait? Or choose VizMan for a stress-free experience? VizMan is an automated visitor management system that works for any industries not limited to factories, societies, government institutes, and warehouses. A new age contactless way of logging information of visitors, employees, packages, and vehicles. VizMan is a digital logbook so it deters unnecessary use of paper or space since there is no requirement of bundles of registers that is left to collect dust in a corner of a room. Visitor’s essential details, helps in scheduling meetings for visitors and employees, and assists in supervising the attendance of the employees. With VizMan, visitors don’t need to wait for hours in long queues. VizMan handles visitors with the value they deserve because we know time is important to you.
Feasible Features
One Subscription, Four Modules – Admin, Employee, Receptionist, and Gatekeeper ensures confidentiality and prevents data from being manipulated
User Friendly – can be easily used on Android, iOS, and Web Interface
Multiple Accessibility – Log in through any device from any place at any time
One app for all industries – a Visitor Management System that works for any organisation.
Stress-free Sign-up
Visitor is registered and checked-in by the Receptionist
Host gets a notification, where they opt to Approve the meeting
Host notifies the Receptionist of the end of the meeting
Visitor is checked-out by the Receptionist
Host enters notes and remarks of the meeting
Customizable Components
Scheduling Meetings – Host can invite visitors for meetings and also approve, reject and reschedule meetings
Single/Bulk invites – Invitations can be sent individually to a visitor or collectively to many visitors
VIP Visitors – Additional security of data for VIP visitors to avoid misuse of information
Courier Management – Keeps a check on deliveries like commodities being delivered in and out of establishments
Alerts & Notifications – Get notified on SMS, email, and application
Parking Management – Manage availability of parking space
Individual log-in – Every user has their own log-in id
Visitor/Meeting Analytics – Evaluate notes and remarks of the meeting stored in the system
Visitor Management System is a secure and user friendly database manager that records, filters, tracks the visitors to your organization.
"Secure Your Premises with VizMan (VMS) – Get It Now"
In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey.
Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience.
Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system.
Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.
2. ● Introduction
● Event sourcing and CQRS
● An emerging technology stack to handle data
● A reference application and it’s architecture
● A few use cases of the reference application
● Conclusion
3. ● Increasing importance of data analytics
● Current state
○ Destructive updates
○ Analytics tools with poor scalability and integration
○ Manual processes
○ Slow iterations
○ Not suitable for large amounts of data
4. ● Whole lifecycle of data
● Data processing
● Data stores
● Integration and messaging
● Distributed computing primitives
● Cluster managers and task schedulers
● Deployment, configuration management and DevOps
● Data analytics and machine learning
● Spark, Mesos, Akka, Cassandra, Kafka (SMACK, Infinity)
9. ● Append only data store
● No updates or deletes (rewriting history)
● Immutable data model
● Decouples data model of the application and storage
● Current state not persisted, but derived. A sequence of updates that led to it.
● History, state known at any point in time
● Replayable
● Source of truth
● Optimisations possible
● Works well in distributed environment - easy partitioning, conflicts
● Helps avoiding transactions
● Works well with DDD
11. ● Command Query Responsibility Segregation
● Read and write logically and physically separated
● Reasoning about the application
● Clear separation of concerns (business logic)
● Often different technology, scalability
● Often lower consistency - eventual, causal
12. Command
● Write side
● Messages, requests to mutate state
● Behaviour, serialized method call essentially
● Don’t expose state
● Validated and may be rejected or emit one or more events (e.g. submitting a form)
Event
● Write side
● Immutable
● Indicating something that has happened
● Atomic record of state change
● Audit log
Query
● Read side
● Precomputed
21. ● Actor backed by data store
● Immutable event sourced journal
● Supports CQRS (write and read side)
● Persistence, replay on failure, rebalance, at least once delivery
29. ● Akka 2.4
● Potentially infinite stream of data
● Ordered, replayable, resumable
● Aggregation, transformation, moving data
● EventsByPersistenceId
● AllPersistenceids
● EventsByTag
30. val readJournal =
PersistenceQuery(system).readJournalFor(CassandraJournal.Identifier)
val source = readJournal.query(
EventsByPersistenceId(UserPersistenceId(name).persistenceId, 0, Long.MaxValue), NoRefresh)
.map(_.event)
.collect{ case s: EntireResistanceExerciseSession => s }
.mapConcat(_.deviations)
.filter(condition)
.map(process)
implicit val mat = ActorMaterializer()
val result = source.runFold(List.empty[ExercisePlanDeviation])((x, y) => y :: x)
31. ● Potentially infinite stream of events
Source[Any].map(process).filter(condition)
Publisher Subscriber
process
condition
backpressure
32. ● In Akka we have the read and write sides separated,
in Cassandra we don’t
● Different data model
● Avoid using operational datastore
● Eventual consistency
● Streaming transformations to different format
● Unify journalled and other data
33. ● Computations and analytics queries on the data
● Often iterative, complex, expensive computations
● Prepared and interactive queries
● Data from multiple sources, joins and transformations
● Often directly on a stream of data
● Whole history of events
● Historical behaviour
● Works retrospectively, can answer questions in the future that we don’t
know exist yet
● Various data types from various sources
● Large amounts of fast data
● Automated analytics
34. ● Cassandra 3.0 - user defined functions, functional indexes, aggregation
functions, materialized views
● Server side denormalization
● Eventual consistency
● Copy of data with different partitioning
userId
performance
35. ● In memory dataflow distributed data processing framework, streaming
and batch
● Distributes computation using a higher level API
● Load balancing
● Moves computation to data
● Fault tolerant
39. ● Cassandra can store
● Spark can process
● Gathering large amounts of heterogeneous data
● Queries
● Transformations
● Complex computations
● Machine learning, data mining, analytics
● Now possible
● Prepared and interactive queries
40. lazy val sparkConf: SparkConf =
new SparkConf()
.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val data = sc.cassandraTable[T]("keyspace", "table").select("columns")
val processedData = data.flatMap(...)...
processedData.saveToCassandra("keyspace", "table")
41. ● Akka Analytics project
● Handles custom Akka serialization
case class JournalKey(persistenceId: String, partition: Long, sequenceNr: Long)
lazy val sparkConf: SparkConf =
new SparkConf()
.setAppName(...).setMaster(...).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val events: RDD[(JournalKey, Any)] = sc.eventTable()
events.sortByKey().map(...).filter(...).collect().foreach(println)
42. ● Spark streaming
● Precomputing using spark or replication often aiming for different data
model
Operational cluster Analytics cluster
Precomputation /
replication
Integration with
other data sources
43. val events: RDD[(JournalKey, Any)] = sc.eventTable().cache().filterClass[EntireResistanceExerciseSession].flatMap(_.deviations)
val deviationsFrequency = sqlContext.sql(
"""SELECT planned.exercise, hour(time), COUNT(1)
FROM exerciseDeviations
WHERE planned.exercise = 'bench press'
GROUP BY planned.exercise, hour(time)""")
val deviationsFrequency2 = exerciseDeviationsDF
.where(exerciseDeviationsDF("planned.exercise") === "bench press")
.groupBy(
exerciseDeviationsDF("planned.exercise"),
exerciseDeviationsDF("time”))
.count()
val deviationsFrequency3 = exerciseDeviations
.filter(_.planned.exercise == "bench press")
.groupBy(d => (d.planned.exercise, d.time.getHours))
.map(d => (d._1, d._2.size))
44. def toVector(user: User): mllib.linalg.Vector =
Vectors.dense(
user.frequency, user.performanceIndex, user.improvementIndex)
val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val users: RDD[User] = events.filterClass[User]
val kmeans = new KMeans()
.setK(5)
.set...
val clusters = kmeans.run(users.map(_.toVector))
45. val weight: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val exerciseDeviations = events
.filterClass[EntireResistanceExerciseSession]
.flatMap(session =>
session.sets.flatMap(set =>
set.sets.map(exercise => (session.id.id, exercise.exercise))))
.groupBy(e => e)
.map(g =>
Rating(normalize(g._1._1), normalize(g._1._2),
normalize(g._2.size)))
val model = new ALS().run(ratings)
val predictions = model.predict(recommend)
bench
press
bicep
curl
dead
lift
user 1 5 2
user 2 4 3
user 3 5 2
user 4 3 1
46. val events = sc.eventTable().cache().toDF()
val lr = new LinearRegression()
val pipeline = new Pipeline().setStages(Array(new UserFilter(), new ZScoreNormalizer(),
new IntensityFeatureExtractor(), lr))
val paramGrid = new ParamGridBuilder()
.addGrid(lr.regParam, Array(0.1, 0.01))
.addGrid(lr.fitIntercept, Array(true, false))
getEligibleUsers(events, sessionEndedBefore)
.map { user =>
val trainValidationSplit = new TrainValidationSplit()
.setEstimator(pipeline)
.setEvaluator(new RegressionEvaluator)
.setEstimatorParamMaps(paramGrid)
val model = trainValidationSplit.fit(
events,
ParamMap(ParamPair(userIdParam, user)))
val testData = // Prepare test data.
val predictions = model.transform(testData)
submitResult(userId, predictions, config)
}
47. val events: RDD[(JournalKey, Any)] = sc.eventTable().cache()
val connections = events.filterClass[Connections]
val vertices: RDD[(VertexId, Long)] =
connections.map(c => (c.id, 1l))
val edges: RDD[Edge[Long]] = connections
.flatMap(c => c.connections
.map(Edge(c.id, _, 1l)))
val graph = Graph(vertices, edges)
val ranks = graph.pageRank(0.0001).vertices
53. ● Exercise domain as an example
● Analytics of both batch (offline) and streaming (online) data
● Analytics important in other areas (banking, stock market, network,
cluster monitoring, business intelligence, commerce, internet of things, ...)
● Enabling value of data
54. ● Event sourcing
● CQRS
● Technologies to handle the data
○ Spark
○ Mesos
○ Akka
○ Cassandra
○ Kafka
● Handling data
● Insights and analytics enable value in data
55.
56. ● Jobs at www.cakesolutions.net/careers
● Code at https://github.com/muvr
● Martin Zapletal @zapletal_martin
● Anirvan Chakraborty @anirvan_c