DC Spark bake off - Realtime TCP Packet Analysis using Spark and Azure Event ...Silvio Fiorito
Slides for my entry to the DC Apache Spark Meetup, Spark Bake Off. I built a demo of a distributed, real-time TCP packet analysis system with Apache Spark, Azure Event Hubs, and Power BI
JC Martin
Distinguished Architect
eBay
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. In short Customers are looking for Serverless Spark Clusters. The Intent of this presentation is to share what is Serverless Spark and what are the benefits of running Spark in serverless manner.
Advanced Troubleshooting Techniques for your Application Stack Using MongoDBSumo Logic
In this session, our experts from MongoDB and Sumo Logic dive into how teams trying to solve visibility challenges at the intersection of DevOps, continuous delivery, and the NoSQL database leveraging machine data analytics. We discuss:
- Basics of MongoDB NoSQL database and the key use cases supported by MongoDB
- Deep discussion of how Sumo Logic provides complete visibility into modern applications and MongoDB
- A hands-on demo of the Sumo Logic App for MongoDB that provides out-of-the-box MongoDB analytics
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...HostedbyConfluent
One of the great things about running applications in the cloud is that you only pay for the resources that you use. But that also makes it more important than ever for our applications to be resource-efficient. This becomes even more critical when we use serverless functions.
Micronaut is an application framework that provides dependency injection, developer productivity features, and excellent support for Apache Kafka. By performing dependency injection, AOP, and other productivity-enhancing magic at compile time, Micronaut allows us to build smaller, more efficient microservices and serverless functions.
In this session, we'll explore the ways that Apache Kafka and Micronaut work together to enable us to build fast, efficient, event-driven applications. Then we'll see it in action, using the AWS Lambda Sink Connector for Confluent Cloud.
Bring your Graphite-compatible metrics into Sumo LogicSumo Logic
If you use open source Graphite software to monitor mission critical applications, you know well the challenges in running, managing and scaling Graphite. Graphite may be ok to get started, but it creates lots of cost and complexity and total-cost-of-ownership headaches as your environment scales.
Sumo Logic provides the industry’s first machine data analytics platform to natively ingest, index and analyze metrics and log data together in real-time.
In this webinar, we will show a live demo of how to:
Ingest graphite compatible metrics into the Sumo Logic service
Analyze and dashboard the metrics to get real-time real-time insights
Correlate Graphite metrics and logs to troubleshoot issues faster
See how easy it is to migrate from graphite to Sumo Logic.
Webinar here: https://youtu.be/MEmFFwNmLxg
Sumo Logic "How To" Webinar - Monitoring you Data: Alerting on Outliers
Dashboards are fantastic, but how do I get notified of critical events? This webinar will cover how to create alerts that will allow your team to effectively monitor business-critical events. Alert channels include email or webhooks into Slack, PagerDuty, DataDog, ServiceNow, or any other webhook you want to develop. What about running custom scripts triggered from alerts? Let's do it.
DC Spark bake off - Realtime TCP Packet Analysis using Spark and Azure Event ...Silvio Fiorito
Slides for my entry to the DC Apache Spark Meetup, Spark Bake Off. I built a demo of a distributed, real-time TCP packet analysis system with Apache Spark, Azure Event Hubs, and Power BI
JC Martin
Distinguished Architect
eBay
ONS2015: http://bit.ly/ons2015sd
ONS Inspire! Webinars: http://bit.ly/oiw-sd
Watch the talk (video) on ONS Content Archives: http://bit.ly/ons-archives-sd
Building and deploying an analytic service on Cloud is a challenge. A bigger challenge is to maintain the service. In a world where users are gravitating towards a model where cluster instances are to provisioned on the fly, in order for these to be used for analytics or other purposes, and then to have these cluster instances shut down when the jobs get done, the relevance of containers and container orchestration is more important than ever. In short Customers are looking for Serverless Spark Clusters. The Intent of this presentation is to share what is Serverless Spark and what are the benefits of running Spark in serverless manner.
Advanced Troubleshooting Techniques for your Application Stack Using MongoDBSumo Logic
In this session, our experts from MongoDB and Sumo Logic dive into how teams trying to solve visibility challenges at the intersection of DevOps, continuous delivery, and the NoSQL database leveraging machine data analytics. We discuss:
- Basics of MongoDB NoSQL database and the key use cases supported by MongoDB
- Deep discussion of how Sumo Logic provides complete visibility into modern applications and MongoDB
- A hands-on demo of the Sumo Logic App for MongoDB that provides out-of-the-box MongoDB analytics
Event-driven Applications with Kafka, Micronaut, and AWS Lambda | Dave Klein,...HostedbyConfluent
One of the great things about running applications in the cloud is that you only pay for the resources that you use. But that also makes it more important than ever for our applications to be resource-efficient. This becomes even more critical when we use serverless functions.
Micronaut is an application framework that provides dependency injection, developer productivity features, and excellent support for Apache Kafka. By performing dependency injection, AOP, and other productivity-enhancing magic at compile time, Micronaut allows us to build smaller, more efficient microservices and serverless functions.
In this session, we'll explore the ways that Apache Kafka and Micronaut work together to enable us to build fast, efficient, event-driven applications. Then we'll see it in action, using the AWS Lambda Sink Connector for Confluent Cloud.
Bring your Graphite-compatible metrics into Sumo LogicSumo Logic
If you use open source Graphite software to monitor mission critical applications, you know well the challenges in running, managing and scaling Graphite. Graphite may be ok to get started, but it creates lots of cost and complexity and total-cost-of-ownership headaches as your environment scales.
Sumo Logic provides the industry’s first machine data analytics platform to natively ingest, index and analyze metrics and log data together in real-time.
In this webinar, we will show a live demo of how to:
Ingest graphite compatible metrics into the Sumo Logic service
Analyze and dashboard the metrics to get real-time real-time insights
Correlate Graphite metrics and logs to troubleshoot issues faster
See how easy it is to migrate from graphite to Sumo Logic.
Webinar here: https://youtu.be/MEmFFwNmLxg
Sumo Logic "How To" Webinar - Monitoring you Data: Alerting on Outliers
Dashboards are fantastic, but how do I get notified of critical events? This webinar will cover how to create alerts that will allow your team to effectively monitor business-critical events. Alert channels include email or webhooks into Slack, PagerDuty, DataDog, ServiceNow, or any other webhook you want to develop. What about running custom scripts triggered from alerts? Let's do it.
The document discusses the evolution of the Apache Spark and Hadoop ecosystems, highlighting how new use cases in genomics, physics, and healthcare have emerged. It also introduces Livy, a new open source REST service for Apache Spark that allows submitting Spark jobs from web and mobile apps without needing a Spark client, and provides multi-tenancy and fault tolerance to support multiple users reliably.
SignalFx is an advanced monitoring and alerting system for cloud applications delivered as SaaS. It provides real-time metrics, analytics, and tagging to monitor microservices architectures. Traditional monitoring approaches are noisy and reactive, while SignalFx aims to provide guided triage and correlate events using time series analytics to identify patterns and anomalies.
This document describes a location-based push notification solution for mobile apps using ArcGIS Online and Parse. The solution allows users to subscribe to notification channels based on geographic area and attribute filters. When new seismic events occur within subscribed areas and meet attribute criteria, a Python script checks the events against user subscriptions and sends push notifications using the Parse API. The architecture is serverless and scalable. The document provides an overview of how each component is used and potential other uses cases for the approach.
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...HostedbyConfluent
As a data professional, you are the glue that makes cross-platform integrations possible. With the increase in adoption of hybrid cloud architectures, Kafka is an increasingly relevant tool for building data pipelines between platforms and accelerating delivery on cloud projects. Early exposure to Kafka on Azure capabilities gives you an edge to build better mousetraps at the design phase.
Customers already running Kafka on premises and are looking to extend Kafka systems to Azure can get started quickly with Confluent Cloud. Additionally, DevOps for self-managed options can be easily scalable with Ansible for Virtual Machines or containers via Azure Kubernetes Services or Azure Container Instances.
This session is presented from the Microsoft Solution Architect perspective by Israel Ekpo, Microsoft Cloud Solution Architect and Alicia Moniz, Microsoft MVP. They will cover use cases and scenarios, along with key Azure integration points and architecture patterns.
This document provides lessons learned from optimizing Apache Spark for NoSQL databases like Riak. Some key lessons include:
1. Parallelizing operations whenever possible to avoid overloading Riak with too many direct key-based gets or secondary index queries.
2. Being smart about data mapping between NoSQL data structures and Spark DataFrames/RDDs for efficient processing.
3. Optimizing performance at all levels from the network protocol to data locality optimizations.
4. Being flexible in supporting multiple languages and deployment environments for Spark and NoSQL integrations.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
How we Auto Scale applications based on CPU with Kubernetes at M6Web?Vincent Gallissot
I explain how to use Requests and Horizontal Pod Autoscaler to autoscale an application. Here with yaml example of our geolocation app at https://tech.m6web.fr/
This talk were given at our Last Friday Talk oct. 18.
Questions & Answers:
Q1: Is it relevant to put high values on the requests?
The value of the requests is taken into account for the triggering of the HPA.
If the app consumes a lot of resources, then yes,
If it consumes little, then autoscaling will be triggered late, or not at all (it will crash before)
In all cases, the application must hold the load beyond the value of the requests: it can consume more
Q2: Is it relevant to have a very high max HPA?
Yes if the app can consume these resources under normal circumstances,
On the other hand to have a max HPA at 1000 time the max value of the application has little interest
It's more like a safeguard if you ever have a bug and you consume too much
Q3: Custom Metrics are defined at the request level?
No, Requests are the CPU and RAM, notions defined at the level of app containers.
Metrics , custom or not, are used to define the HPA target: it is therefore defined at the HPA level
Q4: What is the price of putting a very high max HPA?
None: these requests are not reserved until the pods are launched
So it doesn't cost anything, it's just protection
Q5: What is the waiting time to launch an additional node?
It depends on the cloud provider,
At AWS, for the moment, it's between 3 and 5 minutes,
So it's not instantaneous and it can be problematic in very high peak loads (we look at overprovisioning)
Q6: What is the waiting time to scroll pods?
A few seconds: we start new containers that are created very quickly
We use docker containers for the moment, but Kubernetes is not restrictive to this.
Q7: Can we scroll on a metric history?
Not really. We scale according to a metric, on current values,
The purpose of kubernetes is to have an infra that automatically scales according to the current load.
Predicting a load is not part of its objectives.
However, it is still something that can be done depending on the Prometheus request we make
This document summarizes 5 papers related to big data architecture and deep learning. Paper 1 discusses the Lambda architecture for balancing real-time and batch data processing. Paper 2 introduces Delta Lake for efficient ACID-compliant storage over object stores. Paper 3 proposes the Lakehouse architecture which unifies data warehousing and analytics using Delta Lake. Paper 4 presents the Conformer model that combines transformers and convolutions for speech recognition. The last paper applies intent detection and slot filling to Vietnamese text using BERT. These papers are relevant to the author's graduation thesis on traffic prediction using speech data analysis.
Michelle Casbon gave a presentation on how Idibon uses Apache Spark for natural language processing tasks in a distributed cloud-based infrastructure. Idibon builds machine learning models that can analyze text in any language. They use Spark for feature extraction, training models like logistic regression and SVM, and making predictions on unseen data. Spark allows them to do these NLP tasks at scale. Idibon also developed a persistence layer that allows them to operationalize many models and integrate Spark functionality into their existing systems.
How to Define and Share your Event APIs using AsyncAPI and Event API Products...HostedbyConfluent
Defining Asynchronous APIs and sharing them with your developer community is the most effective way for internal app developers and partners to create new services using real-time event streams. But how do you do it? What specification do you use to define the APIs? What are the best practices for sharing them with the developer community? What framework can you use to code? And what’s next? How do you manage the lifecycle of these APIs? In this talk, Fran Mendez, founder of AsyncAPI and Jonathan Schabowsky, Solace CTO Architect will introduce you to the AsyncAPI specification and show you two different methods to define and share your event APIs, quickly get up to speed, and more. You will learn how to create a Kafka application using asynchronous APIs in minutes!
Monitoring involves analyzing infrastructure issues and failures in both virtual and physical systems. As virtual systems increased, the number of systems needing monitoring also increased significantly. Monitoring is a key part of approaches like DevOps and SRE that focus on system reliability. It involves collecting metrics in real-time, logs about events and activities, and curated alerts based on metrics and logs within their proper context. Tools like Cloudwatch, Riemann, DataDog, Loggly, Splunk, ELK, and Bosun can help bring these facets of monitoring together.
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
Apache Spark 2.0 is a major new release that simplifies the Spark API and improves performance. Some key points:
1) It remains highly compatible with Spark 1.x while building on lessons learned to simplify the API with over 2000 patches from 280 contributors.
2) It introduces structured APIs like DataFrames that allow Spark to optimize queries via whole-stage code generation, providing up to 10x performance gains.
3) It launches a new higher-level streaming API called Structured Streaming that allows developers to write streaming jobs that behave like batch jobs and integrate easily with static data and batch jobs.
QuickStart your Sumo Logic service with this exclusive webinar. At these monthly live events you will learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights.
https://www.sumologic.com/online-training/#start
One Click Streaming Data Pipelines & Flows | Leveraging Kafka & Spark | Ido F...HostedbyConfluent
The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.
Digital Transformation & Solvency II Simulations for L&G: Optimizing, Acceler...OW2
Legal & General, a 181-year old financial services group, selected ActiveEon ProActive to modernize their Solvency II risk calculation simulations and help migrate them from their private datacenter to Microsoft Azure cloud. ProActive would optimize and accelerate the simulations, running tasks in parallel across cloud resources. Using ProActive's workflows and scheduling, the end-to-end simulation time was reduced from 18 hours to 5 hours, and high priority reports became available 5 hours sooner. Monitoring ensured tasks continued running even if cloud resources failed.
This document proposes a solution to streamline the database monitoring workflow by removing manual steps and integrating tools. Currently, alerts are sent over email requiring manual lookups of host details and comparisons to ignore lists. The proposed solution is to configure the monitoring tool to push alerts to a script that processes them along with inventory data to generate a web dashboard. The dashboard would group alerts and allow one-click access to production databases eliminating manual SSH sessions and menu navigation. Benefits include a task-focused interface, no data copying/pasting, and potential to integrate with configuration tools.
Building adaptive user experiences using Contextual Multi-Armed Bandits with...HostedbyConfluent
At Expedia Group, providing a customized experience for travellers is key to unlocking the best possibilities for each individual traveller and each type of trip.Contextual multi-armed bandits provide a natural approach to develop personalization of user experience and improve content relevancy. In this talk,we present the end-to-end scalable system developed to democratize the use of contextual bandits at EG.The architecture comprises of an online inference component as well as a continuous feedback loop that tracks the users’ affinity towards certain content or page layouts. Kafka is the backbone of our system, powering high-performance streaming jobs that provide bandits with real-time feedback signals to learn from over time. We describe our experience using Kafka for the user interactions events and bandit feedback messages at scale. Lastly,we look at how we plan to expand our use of Kafka to build an off policy evaluation framework to evaluate the effectiveness of new algorithms.
Asynchronous Hyperparameter Optimization with Apache SparkDatabricks
For the past two years, the open-source Hopsworks platform has used Spark to distribute hyperparameter optimization tasks for Machine Learning. Hopsworks provides some basic optimizers (gridsearch, randomsearch, differential evolution) to propose combinations of hyperparameters (trials) that are run synchronously in parallel on executors as map functions. However, many such trials perform poorly, and we waste a lot of CPU and harware accelerator cycles on trials that could be stopped early, freeing up the resources for other trials.
In this talk, we present our work on Maggy, an open-source asynchronous hyperparameter optimization framework built on Spark that transparently schedules and manages hyperparameter trials, increasing resource utilization, and massively increasing the number of trials that can be performed in a given period of time on a fixed amount of resources. Maggy is also used to support parallel ablation studies using Spark. We have commercial users evaluating Maggy and we will report on the gains they have seen in reduced time to find good hyperparameters and improved utilization of GPU hardware. Finally, we will perform a live demo on a Jupyter notebook, showing how to integrate maggy in existing PySpark applications.
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...HostedbyConfluent
Time series data is everywhere -- connected IoT devices, application monitoring & observability platforms, and more. What makes time series datastreams challenging is that they often have orders of magnitude more data than other workloads, with millions of time series datapoints being quite common. Given its ability to ingest high volumes of data, Kafka is a natural part of any data architecture handling large volumes of time series telemetry, specifically as an intermediate buffer before that data is persisted in InfluxDB for processing, analysis, and use in other applications. In this session, we will show you how you can stream time series data to your IoT application using Kafka queues and InfluxDB, drawing upon deployments done at Hulu and Wayfair that allow both to ingest 1 million metrics per second. Once this session is complete, you’ll be able to connect a Kafka queue to an InfluxDB instance as the beginning of your own time series data pipeline.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020)
https://www.linkedin.com/in/erenavsarogullari/
https://www.linkedin.com/in/pavelhardak/
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
The document discusses the evolution of the Apache Spark and Hadoop ecosystems, highlighting how new use cases in genomics, physics, and healthcare have emerged. It also introduces Livy, a new open source REST service for Apache Spark that allows submitting Spark jobs from web and mobile apps without needing a Spark client, and provides multi-tenancy and fault tolerance to support multiple users reliably.
SignalFx is an advanced monitoring and alerting system for cloud applications delivered as SaaS. It provides real-time metrics, analytics, and tagging to monitor microservices architectures. Traditional monitoring approaches are noisy and reactive, while SignalFx aims to provide guided triage and correlate events using time series analytics to identify patterns and anomalies.
This document describes a location-based push notification solution for mobile apps using ArcGIS Online and Parse. The solution allows users to subscribe to notification channels based on geographic area and attribute filters. When new seismic events occur within subscribed areas and meet attribute criteria, a Python script checks the events against user subscriptions and sends push notifications using the Parse API. The architecture is serverless and scalable. The document provides an overview of how each component is used and potential other uses cases for the approach.
Confluent On Azure: Why you should add Confluent to your Azure toolkit | Alic...HostedbyConfluent
As a data professional, you are the glue that makes cross-platform integrations possible. With the increase in adoption of hybrid cloud architectures, Kafka is an increasingly relevant tool for building data pipelines between platforms and accelerating delivery on cloud projects. Early exposure to Kafka on Azure capabilities gives you an edge to build better mousetraps at the design phase.
Customers already running Kafka on premises and are looking to extend Kafka systems to Azure can get started quickly with Confluent Cloud. Additionally, DevOps for self-managed options can be easily scalable with Ansible for Virtual Machines or containers via Azure Kubernetes Services or Azure Container Instances.
This session is presented from the Microsoft Solution Architect perspective by Israel Ekpo, Microsoft Cloud Solution Architect and Alicia Moniz, Microsoft MVP. They will cover use cases and scenarios, along with key Azure integration points and architecture patterns.
This document provides lessons learned from optimizing Apache Spark for NoSQL databases like Riak. Some key lessons include:
1. Parallelizing operations whenever possible to avoid overloading Riak with too many direct key-based gets or secondary index queries.
2. Being smart about data mapping between NoSQL data structures and Spark DataFrames/RDDs for efficient processing.
3. Optimizing performance at all levels from the network protocol to data locality optimizations.
4. Being flexible in supporting multiple languages and deployment environments for Spark and NoSQL integrations.
AI on Spark for Malware Analysis and Anomalous Threat DetectionDatabricks
At Avast, we believe everyone has the right to be safe. We are dedicated to creating a world that provides safety and privacy for all, not matter where you are, who you are, or how you connect. With over 1.5 billion attacks stopped and 30 million new executable files monthly, big data pipelines are crucial for the security of our customers. At Avast we are leveraging Apache Spark machine learning libraries and TensorflowOnSpark for a variety of tasks ranging from marketing and advertisement, through network security to malware detection. This talk will cover our main cybersecurity usecases of Spark. After describing our cluster environment we will first demonstrate anomaly detection on time series of threats. Having thousands of types of attacks and malware, AI helps human analysts select and focus on most urgent or dire threats. We will walk through our setup for distributed training of deep neural networks with Tensorflow to deploying and monitoring of a streaming anomaly detection application with trained model. Next we will show how we use Spark for analysis and clustering of malicious files and large scale experimentation to automatically process and handle changes in malware. In the end, we will give comparison to other tools we used for solving those problems.
How we Auto Scale applications based on CPU with Kubernetes at M6Web?Vincent Gallissot
I explain how to use Requests and Horizontal Pod Autoscaler to autoscale an application. Here with yaml example of our geolocation app at https://tech.m6web.fr/
This talk were given at our Last Friday Talk oct. 18.
Questions & Answers:
Q1: Is it relevant to put high values on the requests?
The value of the requests is taken into account for the triggering of the HPA.
If the app consumes a lot of resources, then yes,
If it consumes little, then autoscaling will be triggered late, or not at all (it will crash before)
In all cases, the application must hold the load beyond the value of the requests: it can consume more
Q2: Is it relevant to have a very high max HPA?
Yes if the app can consume these resources under normal circumstances,
On the other hand to have a max HPA at 1000 time the max value of the application has little interest
It's more like a safeguard if you ever have a bug and you consume too much
Q3: Custom Metrics are defined at the request level?
No, Requests are the CPU and RAM, notions defined at the level of app containers.
Metrics , custom or not, are used to define the HPA target: it is therefore defined at the HPA level
Q4: What is the price of putting a very high max HPA?
None: these requests are not reserved until the pods are launched
So it doesn't cost anything, it's just protection
Q5: What is the waiting time to launch an additional node?
It depends on the cloud provider,
At AWS, for the moment, it's between 3 and 5 minutes,
So it's not instantaneous and it can be problematic in very high peak loads (we look at overprovisioning)
Q6: What is the waiting time to scroll pods?
A few seconds: we start new containers that are created very quickly
We use docker containers for the moment, but Kubernetes is not restrictive to this.
Q7: Can we scroll on a metric history?
Not really. We scale according to a metric, on current values,
The purpose of kubernetes is to have an infra that automatically scales according to the current load.
Predicting a load is not part of its objectives.
However, it is still something that can be done depending on the Prometheus request we make
This document summarizes 5 papers related to big data architecture and deep learning. Paper 1 discusses the Lambda architecture for balancing real-time and batch data processing. Paper 2 introduces Delta Lake for efficient ACID-compliant storage over object stores. Paper 3 proposes the Lakehouse architecture which unifies data warehousing and analytics using Delta Lake. Paper 4 presents the Conformer model that combines transformers and convolutions for speech recognition. The last paper applies intent detection and slot filling to Vietnamese text using BERT. These papers are relevant to the author's graduation thesis on traffic prediction using speech data analysis.
Michelle Casbon gave a presentation on how Idibon uses Apache Spark for natural language processing tasks in a distributed cloud-based infrastructure. Idibon builds machine learning models that can analyze text in any language. They use Spark for feature extraction, training models like logistic regression and SVM, and making predictions on unseen data. Spark allows them to do these NLP tasks at scale. Idibon also developed a persistence layer that allows them to operationalize many models and integrate Spark functionality into their existing systems.
How to Define and Share your Event APIs using AsyncAPI and Event API Products...HostedbyConfluent
Defining Asynchronous APIs and sharing them with your developer community is the most effective way for internal app developers and partners to create new services using real-time event streams. But how do you do it? What specification do you use to define the APIs? What are the best practices for sharing them with the developer community? What framework can you use to code? And what’s next? How do you manage the lifecycle of these APIs? In this talk, Fran Mendez, founder of AsyncAPI and Jonathan Schabowsky, Solace CTO Architect will introduce you to the AsyncAPI specification and show you two different methods to define and share your event APIs, quickly get up to speed, and more. You will learn how to create a Kafka application using asynchronous APIs in minutes!
Monitoring involves analyzing infrastructure issues and failures in both virtual and physical systems. As virtual systems increased, the number of systems needing monitoring also increased significantly. Monitoring is a key part of approaches like DevOps and SRE that focus on system reliability. It involves collecting metrics in real-time, logs about events and activities, and curated alerts based on metrics and logs within their proper context. Tools like Cloudwatch, Riemann, DataDog, Loggly, Splunk, ELK, and Bosun can help bring these facets of monitoring together.
Simplifying Big Data Applications with Apache Spark 2.0Spark Summit
Apache Spark 2.0 is a major new release that simplifies the Spark API and improves performance. Some key points:
1) It remains highly compatible with Spark 1.x while building on lessons learned to simplify the API with over 2000 patches from 280 contributors.
2) It introduces structured APIs like DataFrames that allow Spark to optimize queries via whole-stage code generation, providing up to 10x performance gains.
3) It launches a new higher-level streaming API called Structured Streaming that allows developers to write streaming jobs that behave like batch jobs and integrate easily with static data and batch jobs.
QuickStart your Sumo Logic service with this exclusive webinar. At these monthly live events you will learn how to capitalize on critical capabilities that can amplify your log analytics and monitoring experience while providing you with meaningful business and IT insights.
https://www.sumologic.com/online-training/#start
One Click Streaming Data Pipelines & Flows | Leveraging Kafka & Spark | Ido F...HostedbyConfluent
The Apache Kafka ecosystem is very rich with components and pieces that make for designing and implementing secure, efficient, fault-tolerant and scalable event stream processing (ESP) systems. Using real-world examples, this talk covers why Apache Kafka is an excellent choice for cloud-native and hybrid architectures, how to go about designing, implementing and maintaining ESP systems, best practices and patterns for migrating to the cloud or hybrid configurations, when to go with PaaS or IaaS, what options are available for running Kafka in cloud or hybrid environments and what you need to build and maintain successful ESP systems that are secure, performant, reliable, highly-available and scalable.
Digital Transformation & Solvency II Simulations for L&G: Optimizing, Acceler...OW2
Legal & General, a 181-year old financial services group, selected ActiveEon ProActive to modernize their Solvency II risk calculation simulations and help migrate them from their private datacenter to Microsoft Azure cloud. ProActive would optimize and accelerate the simulations, running tasks in parallel across cloud resources. Using ProActive's workflows and scheduling, the end-to-end simulation time was reduced from 18 hours to 5 hours, and high priority reports became available 5 hours sooner. Monitoring ensured tasks continued running even if cloud resources failed.
This document proposes a solution to streamline the database monitoring workflow by removing manual steps and integrating tools. Currently, alerts are sent over email requiring manual lookups of host details and comparisons to ignore lists. The proposed solution is to configure the monitoring tool to push alerts to a script that processes them along with inventory data to generate a web dashboard. The dashboard would group alerts and allow one-click access to production databases eliminating manual SSH sessions and menu navigation. Benefits include a task-focused interface, no data copying/pasting, and potential to integrate with configuration tools.
Building adaptive user experiences using Contextual Multi-Armed Bandits with...HostedbyConfluent
At Expedia Group, providing a customized experience for travellers is key to unlocking the best possibilities for each individual traveller and each type of trip.Contextual multi-armed bandits provide a natural approach to develop personalization of user experience and improve content relevancy. In this talk,we present the end-to-end scalable system developed to democratize the use of contextual bandits at EG.The architecture comprises of an online inference component as well as a continuous feedback loop that tracks the users’ affinity towards certain content or page layouts. Kafka is the backbone of our system, powering high-performance streaming jobs that provide bandits with real-time feedback signals to learn from over time. We describe our experience using Kafka for the user interactions events and bandit feedback messages at scale. Lastly,we look at how we plan to expand our use of Kafka to build an off policy evaluation framework to evaluate the effectiveness of new algorithms.
Asynchronous Hyperparameter Optimization with Apache SparkDatabricks
For the past two years, the open-source Hopsworks platform has used Spark to distribute hyperparameter optimization tasks for Machine Learning. Hopsworks provides some basic optimizers (gridsearch, randomsearch, differential evolution) to propose combinations of hyperparameters (trials) that are run synchronously in parallel on executors as map functions. However, many such trials perform poorly, and we waste a lot of CPU and harware accelerator cycles on trials that could be stopped early, freeing up the resources for other trials.
In this talk, we present our work on Maggy, an open-source asynchronous hyperparameter optimization framework built on Spark that transparently schedules and manages hyperparameter trials, increasing resource utilization, and massively increasing the number of trials that can be performed in a given period of time on a fixed amount of resources. Maggy is also used to support parallel ablation studies using Spark. We have commercial users evaluating Maggy and we will report on the gains they have seen in reduced time to find good hyperparameters and improved utilization of GPU hardware. Finally, we will perform a live demo on a Jupyter notebook, showing how to integrate maggy in existing PySpark applications.
Stream processing IoT time series data with Kafka & InfluxDB | Al Sargent, In...HostedbyConfluent
Time series data is everywhere -- connected IoT devices, application monitoring & observability platforms, and more. What makes time series datastreams challenging is that they often have orders of magnitude more data than other workloads, with millions of time series datapoints being quite common. Given its ability to ingest high volumes of data, Kafka is a natural part of any data architecture handling large volumes of time series telemetry, specifically as an intermediate buffer before that data is persisted in InfluxDB for processing, analysis, and use in other applications. In this session, we will show you how you can stream time series data to your IoT application using Kafka queues and InfluxDB, drawing upon deployments done at Hulu and Wayfair that allow both to ingest 1 million metrics per second. Once this session is complete, you’ll be able to connect a Kafka queue to an InfluxDB instance as the beginning of your own time series data pipeline.
Spark + AI Summit 2019: Apache Spark Listeners: A Crash Course in Fast, Easy ...Landon Robinson
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
Spark Development Lifecycle at Workday - ApacheCon 2020Pavel Hardak
Presented by Eren Avsarogullari and Pavel Hardak (ApacheCon 2020)
https://www.linkedin.com/in/erenavsarogullari/
https://www.linkedin.com/in/pavelhardak/
Apache Spark is the backbone of Workday's Prism Analytics Platform, supporting various data processing use-cases such as Data Ingestion, Preparation(Cleaning, Transformation & Publishing) and Discovery. At Workday, we extend Spark OSS repo and build custom Spark releases covering our custom patches on the top of Spark OSS patches. Custom Spark release development introduces the challenges when supporting multiple Spark versions against to a single repo and dealing with large numbers of customers, each of which can execute their own long-running Spark Applications. When building the custom Spark releases and new Spark features, dedicated Benchmark pipeline is also important to catch performance regression by running the standard TPC-H & TPC-DS queries against to both Spark versions and monitoring Spark driver & executors' runtime behaviors before production. At deployment phase, we also follow progressive roll-out plan leveraged by Feature Toggles used to enable/disable the new Spark features at the runtime. As part of our development lifecycle, Feature Toggles help on various use cases such as selection of Spark compile-time and runtime versions, running test pipelines against to both Spark versions on the build pipeline and supporting progressive roll-out deployment when dealing with large numbers of customers and long-running Spark Applications. On the other hand, executed Spark queries' operation level runtime behaviors are important for debugging and troubleshooting. Incoming Spark release is going to introduce new SQL Rest API exposing executed queries' operation level runtime metrics and we transform them to queryable Hive tables in order to track operation level runtime behaviors per executed query. In the light of these, this session aims to cover Spark feature development lifecycle at Workday by covering custom Spark Upgrade model, Benchmark & Monitoring Pipeline and Spark Runtime Metrics Pipeline details through used patterns and technologies step by step.
Apache Spark Development Lifecycle @ Workday - ApacheCon 2020Eren Avşaroğulları
Workday uses Apache Spark as the foundational technology for its Prism Analytics product. It has developed a custom Spark upgrade model to handle upgrading Spark across its multi-tenant environment. Workday also collects runtime metrics on Spark SQL queries using a custom metrics pipeline and REST API. Future plans include upgrading to Spark 3.x and improving multi-tenancy support through a "Multiverse" deployment model.
Apache Spark Listeners: A Crash Course in Fast, Easy MonitoringDatabricks
The Spark Listener interface provides a fast, simple and efficient route to monitoring and observing your Spark application - and you can start using it in minutes. In this talk, we'll introduce the Spark Listener interfaces available in core and streaming applications, and show a few ways in which they've changed our world for the better at SpotX. If you're looking for a "Eureka!" moment in monitoring or tracking of your Spark apps, look no further than Spark Listeners and this talk!
- Vinay Mittal is an IT professional with over 10 years of experience in C++ development. He currently works as a Computer Scientist at Adobe India.
- His skills include C/C++, Perl, Unix shell scripting, Javascript, AWS services, SQL databases, version control systems, and UNIX/Linux systems.
- Previous experience includes developing multi-threaded C++ applications at RBS and security applications at CA. At Amazon he worked on product ads and billing systems.
- Education includes a Masters in Computer Science from IIT Roorkee with honors.
Broadcast Music Inc - Release Automation Rockstars!ghodgkinson
The document describes Broadcast Music Inc.'s automation of their software release process using IBM Rational tools. It discusses:
1. BMI's goals for automated release management including assembly, deployment, rollback, and redeployment.
2. How different IBM Rational tools like Team Concert, Quality Manager, and Build Forge are used to automate builds, testing, and releases of various BMI systems like WebSphere, Portal, and DataPower.
3. The technical details of setting up automated builds and deployments using Ant scripts for various components, promoting changes between environments, and storing assembled artifacts.
The document discusses the problems solved at MyMusicTaste including multi-region deployment, microservices architecture, and single page applications. It introduced the speaker and their background with Mathpresso and MyMusicTaste. It then covered how they implemented multi-region databases, service orchestration, CI/CD pipelines, secret management, and server-side rendering for SEO with Lambda@Edge and CloudFront. The conclusion discussed further scaling of teams and solutions.
This document provides a summary of Aleksandr Savelyev's experience as a Senior Software Engineer including his skills, projects, and employment history. He has over 15 years of experience developing web and desktop applications using technologies like C#, Java, .NET, SQL, and more. Notable projects include rewriting a desktop application to be web-based, developing services to transfer video files between servers, and building monitoring applications. He was previously employed at CCH Wolters Kluwer and DGFastChannel where he worked on various projects delivering digital video advertisements.
This document discusses big data solutions in the cloud. It describes Hadoop, MapReduce, and NoSQL databases for storing and analyzing large datasets. It also discusses AWS services like S3, EMR, Redshift, DynamoDB, and Kinesis that can be used to build scalable big data architectures in the cloud. Examples are provided showing how these AWS services can be used together to perform log analysis, recommendations, and streaming analytics on big data.
Kumar Ramaswamy provides services related to developing highly scalable and secure distributed systems using technologies like PostgreSQL, HDFS, Spark and Kafka. He has extensive experience architecting fault tolerant systems and has developed distributed systems for tasks like product tracking and advertisement data processing. His background includes work with technologies such as Unix, Java, distributed databases and big data platforms.
"The Suitcase" Project Cloud QTR meeting presentation @ Disney/ABCETCenter
The document outlines several test cases for developing and evaluating a next-generation cloud-based media production workflow and framework called C4. The test cases cover areas like ingesting and transcoding media in the cloud, using cloud storage and orchestration software, developing metadata schemas, and capturing and streaming 360-degree video. They also describe proposed virtual reality enhancements like live streaming footage from a 360 camera on set or projecting 360 video onto a dome for audiences to view with augmented reality.
batbern43 Events - Lessons learnt building an Enterprise Data BusBATbern
Swissport ist weltweit der führende Dienstleister im Groundhandling und bei Cargo. Über 300 Flughäfen in 50 Ländern werden bedient. Dabei spielen Daten eine führende Rolle: wann und wo hebt ein Flugzeug ab, wie lange dauerte die Betankung, welche Gepäcktücke sind auszuladen? Aufgrund von Wachstum durch Übernahmen ist die IT Landschaft durch eine Vielzahl von Silos charakterisiert - was übergreifende Analytik und die Nutzung von Daten in neuen Kontexten erschwert. Gleichzeitig werden weltweit verschiedene IT Governance Modelle verfolgt, was zu Inkonsistenzen in Prozessen, Datenzugriffen und Datenqualitätsverlusten führt. Diese Situation wird adressiert durch eine Vision für eine Eventgetriebene Architektur und deren Verankerung im Management Prozessen und Prinzipien zur Realisierung der Vision - sowie deren Umsetzung den Aufbau eines Enterprise Datenmodells, Governance für den Zugriff und die Dokumentation von Daten. Erfahrungen aus diesen Schritten werden in der Präsentation reflektiert. Dabei ist zu beachten, dass der Aufbau der zugrundeliegenden Plattform unter strikter Kostenkontrolle stand und lediglich 10 zweiwöchentliche Sprints für die erste Produktivsetzung genutzt werden konnten.
This document discusses analytics at the edge in Internet of Things environments. It provides an overview of edge computing and examples of edge devices. It then introduces Apache Edgent (formerly Quarks), an open source programming model and runtime for streaming analytics at the edge. The document also discusses using the Informix database for analytics on sensor data both at the edge and in the cloud, and it demonstrates connecting Edgent to Informix on a Raspberry Pi for real-time sensor data analysis.
Data Streaming with Apache Kafka & MongoDBconfluent
Explore the use-cases and architecture for Apache Kafka, and how it integrates with MongoDB to build sophisticated data-driven applications that exploit new sources of data.
If you're like most of the world, you're on an aggressive race to implement machine learning applications and on a path to get to deep learning. If you can give better service at a lower cost, you will be the winners in 2030. But infrastructure is a key challenge to getting there. What does the technology infrastructure look like over the next decade as you move from Petabytes to Exabytes? How are you budgeting for more colossal data growth over the next decade? How do your data scientists share data today and will it scale for 5-10 years? Do you have the appropriate security, governance, back-up and archiving processes in place? This session will address these issues and discuss strategies for customers as they ramp up their AI journey with a long term view.
6. DISZ - Webalkalmazások skálázhatósága a Google Cloud PlatformonMárton Kodok
Az előadás témája hogyan építhető fel egy rugalmas, jól skálázható szolgáltatás a felhőszolgáltatók platformjain. Hogyan lehet megoldani, hogy a szolgáltatás, amelynek induláskor legfeljebb néhány tíz vagy száz felhasználót kell kiszolgálnia, akár több ezer vagy nagyságrendekkel több felhasználót is képes legyen kiszolgálni rugalmasan? Hátradőlni és csodálni az autoscaling funkciót a Black Friday napján. Beszélni fogunk virtualizációról, platformszintű virtualizációről, szuperkönnyű alkalmazáskonténerekről, a munkaterhek közel valósidejű “pakolgatásával”. Bemutatásra kerül a Google Cloud Platform számos komponense. Bankok, biztosítók, webshopok és így tovább mind a cloudban látják a kitörési pontot.
This document provides an overview of cloud native monitoring with Prometheus. It discusses Prometheus and how it has become the standard for metrics-based monitoring. It covers monitoring systems and applications with Prometheus, including scraping metrics, querying, and instrumenting applications to expose metrics. It also discusses alerting with Alertmanager and scaling Prometheus through federation and projects like Thanos. The document aims to explain how Prometheus enables observability of systems in cloud native environments and the growing ecosystem around Prometheus.
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
"Low latency analytics is becoming a very popular scenario. In this session we will discuss several architectural options for doing
analytics on moving data using Amazon Kinesis and EMR/Spark Streaming and share some best practices and real world examples."
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive function. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms for those who already suffer from conditions like anxiety and depression.
What companies hiring data scientists and hadoop developers are looking for?DeZyre
Companies in many industries, including oil and gas, insurance, social media, and government, are hiring data scientists and Hadoop developers. The document provides strategies for job seekers to demonstrate their qualifications to hiring managers, including illustrating how they can perform required tasks without extensive years of experience. It also outlines what interviewers look for, such as business acumen for data scientists. Salary ranges are provided for various big data roles from entry-level to experienced positions. Contact information is included at the end for follow up.
This document promotes additional reading on big data and Hadoop training by providing clickable links to read a complete article on the topic as well as learn more about big data and Hadoop training opportunities. It points the reader towards further resources without providing much summary or context of its own.
This document discusses how programming is essential for data science work. It explains that while data science builds on statistics, it now requires a diverse set of skills including programming. Programming is needed for tasks like data wrangling, analysis, modeling, deployment, and more. The document recommends Python or R as good options for the programming component of data science and provides examples of how programming supports functions like data exploration, modeling, building production systems, and more. Overall, it argues that programming proficiency is a core requirement for modern data science work.
This document discusses big data and Hadoop training. It provides links to read a complete article on 5 big data use cases and to learn more about IBM Certified big data and Hadoop training. Clicking the links would take the reader to more information on common big data uses and certification programs.
Average salaries for big data and Hadoop developers have increased 9.3% in the last year, now ranging from $119,250 to $168,250 annually. There are over 500 open big data jobs in San Francisco, where the average salary for Hadoop developers is $139,000, and senior Hadoop developers can earn over $178,000. The states with the most big data and Hadoop jobs are California, New York, New Jersey, and Texas.
The document discusses the benefits of exercise for mental health. Regular physical activity can help reduce anxiety and depression and improve mood and cognitive functioning. Exercise causes chemical changes in the brain that may help protect against mental illness and improve symptoms.
This document provides guidance on becoming a data scientist by outlining important skills to learn like statistics, programming, visualization, and big data concepts. It recommends starting with hands-on SQL and statistical learning in R or Python, developing expertise in data visualization, and learning to apply techniques such as regression, classification, and recommendation engines. The document advises demonstrating what you've learned by applying for data scientist positions.
This document discusses how big data is transforming business intelligence. It outlines some of the pains of traditional BI, including maintaining large data warehouses and only considering structured data. The document advocates for an open source approach using Hadoop as an "extended data warehouse" to address these issues. Examples of recent Solocal Group projects involving real-time business analytics and a search power selector are provided. Advice is given on how companies can activate big data projects and start the BI transformation.
Big Data analytics is revolutionizing the sports industry by helping teams and players analyze massive amounts of data to improve performance, prevent injuries, and enhance the fan experience. Sports teams are collecting data from cameras, sensors, wearables and other sources to analyze player performance, predict outcomes, and develop strategies. This data combined with analytics allows teams to gain competitive advantages and fans to more accurately predict winners. While big data provides insights, human experience and instincts are still needed to apply the strategies during games.
Big data solutions are enabling healthcare providers to transform into more patient-centered, collaborative care models driven by analytics. As basic needs are met and advanced applications emerge, new use cases will arise from sources like wearable devices and sensors. Predictive analytics using big data can help fill gaps by predicting things like missed appointments, noncompliance, and patient trajectories in order to proactively manage care. However, barriers to using big data include a lack of expertise and the fact that big data has a different structure and is more unstructured than traditional databases.
Big data refers to extremely large data sets that are difficult to process using traditional data processing applications. Hadoop is an open-source software framework that structures big data for analytics purposes using a distributed computing architecture. Demand for big data skills like Hadoop development and administration is increasing significantly, with salaries offering healthy premiums, as more organizations use big data analytics to make important predictions. DeZyre offers job-skills training courses developed jointly with industry partners, delivered through an interactive online platform, to help people learn skills like Hadoop from experts and get certified.
25 things that make Amazons Jeff Bezos, Jeff BezosDeZyre
Based on the book by Brad Stone "The Everything Store" - this presentation walks you through some of Jeff Bezos personality traits that made Amazon.com successful.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...Social Samosa
The Modern Marketing Reckoner (MMR) is a comprehensive resource packed with POVs from 60+ industry leaders on how AI is transforming the 4 key pillars of marketing – product, place, price and promotions.
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Data and AI
Discussion on Vector Databases, Unstructured Data and AI
https://www.meetup.com/unstructured-data-meetup-new-york/
This meetup is for people working in unstructured data. Speakers will come present about related topics such as vector databases, LLMs, and managing data at scale. The intended audience of this group includes roles like machine learning engineers, data scientists, data engineers, software engineers, and PMs.This meetup was formerly Milvus Meetup, and is sponsored by Zilliz maintainers of Milvus.
Analysis insight about a Flyball dog competition team's performanceroli9797
Insight of my analysis about a Flyball dog competition team's last year performance. Find more: https://github.com/rolandnagy-ds/flyball_race_analysis/tree/main
The Building Blocks of QuestDB, a Time Series Databasejavier ramirez
Talk Delivered at Valencia Codes Meetup 2024-06.
Traditionally, databases have treated timestamps just as another data type. However, when performing real-time analytics, timestamps should be first class citizens and we need rich time semantics to get the most out of our data. We also need to deal with ever growing datasets while keeping performant, which is as fun as it sounds.
It is no wonder time-series databases are now more popular than ever before. Join me in this session to learn about the internal architecture and building blocks of QuestDB, an open source time-series database designed for speed. We will also review a history of some of the changes we have gone over the past two years to deal with late and unordered data, non-blocking writes, read-replicas, or faster batch ingestion.
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
State of Artificial intelligence Report 2023kuntobimo2016
Artificial intelligence (AI) is a multidisciplinary field of science and engineering whose goal is to create intelligent machines.
We believe that AI will be a force multiplier on technological progress in our increasingly digital, data-driven world. This is because everything around us today, ranging from culture to consumer products, is a product of intelligence.
The State of AI Report is now in its sixth year. Consider this report as a compilation of the most interesting things we’ve seen with a goal of triggering an informed conversation about the state of AI and its implication for the future.
We consider the following key dimensions in our report:
Research: Technology breakthroughs and their capabilities.
Industry: Areas of commercial application for AI and its business impact.
Politics: Regulation of AI, its economic implications and the evolving geopolitics of AI.
Safety: Identifying and mitigating catastrophic risks that highly-capable future AI systems could pose to us.
Predictions: What we believe will happen in the next 12 months and a 2022 performance review to keep us honest.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
3. A quick intro about Beaconstac 1
Beaconstac is a proximity marketing and analytics platform for
beacons
Several beacon specific events are defined to aid proximity marketing
The events include Camp on event, beacon exit event, region enter,
region exit etc.
Beaconstac analytics platform makes it easy for
managers/marketers/developers to analyze event data
Components include Beaconstac iOS/Android sdk, beaconstac portal
4. Why Hadoop? 1
Collect event logs generated from Beaconstac SDK usage
Needed a system to answer queries like
o Heat map of beacons by the number of visits received in a specified time
interval.
o Heat map of beacons by the amount of time spent in a specified time
interval.
o Average time spent by users near different beacons
o Last seen per user
o Last seen per beacon
o Analyzing data with custom attributes filters
o Traversed path in an area by individual users
5. Leveraging Amazon's EMR for Beaconstac
Analytics
1
Amazon's Streaming API for writing mapper and reducer functions in Python
Input - Copy programs to Amazon S3
Output – Copy the processed/output data to S3
Initial tests were run using Amazon's EMR console. Here you can
define the following -
1) Cluster configuration – Name, Termination protection, Logging,
logs location on S3 etc.
2) Software configuration – Hadoop AMI version, applications to be
installed on startup etc.
3) Hardware configuration – Types of nodes – master, Core and
Task
4) Security keys, allowed users
5) Bootstrap actions – Configure Hadoop, Custom actions etc.
6) Steps – Streaming program, Hive program, Pig program
9. How Does AWS Data Pipeline Work? 1
Pipeline definition - specifies the business logic of your data management
AWS Data pipeline web service - interprets the pipeline definition and assigns
tasks to workers to move and transform data.
Task runner - polls the AWS Data Pipeline web service for tasks and then
performs those tasks.
10. Morpheus version of Data pipeline 1
Runs every hour
Requires a Kafka
consumer script
Copy the
output to
Elastic
Search
Run EMR
jobs
Copy logs
from Kafka
to S3
Runs once every
day
Processes each
job and produces
output
Each job
comprises of
mapper and
reducer scripts
Runs once every
day
Inserts output in
Elastic search