Strata SF 2019 presentation about presto's limitation in leveraging spot nodes, qubole's features to reliably use spot nodes in presto and case study on the efficacy of the solution
Presto talk @ Global AI conference 2018 Bostonkbajda
Presented at Global AI Conference in Boston 2018:
http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi
This presentation is an overview of the "Temporal Performance Modeling of Serverless Computing Platforms" paper published in Sixth International Workshop on Serverless Computing (WoSC6) 2020 as part of IEEE Middleware conference.
Authors: Nima Mahmoudi and Hamzeh Khazaei
Paper: https://www.serverlesscomputing.org/wosc6/#p1
Preprint and Artifacts: https://research.nima-dev.com/publication/mahmoudi-2020-tempperf/
Full Presentation: https://youtu.be/9r3j_1B5t8c
Lightning Talk (1 min): https://youtu.be/E5KigIq0Z1E
PACS Lab: https://pacs.eecs.yorku.ca/
Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB
Join ScyllaDB engineer Pavel Emelyanov who will provide a walkthrough of Diskplorer, an open-source disk latency/bandwidth exploring toolset to measure behavior under load. By using Linux fio under the hood Diskplorer runs a battery of measurements to discover performance characteristics for a specific hardware configuration, giving you an at-a-glance view of how server storage I/O will behave under load.
Discover how ScyllaDB uses this elaborated model of disk performance, as well as a scheduling algorithm developed for the Seastar framework to build latency-oriented I/O scheduling that cherry-picks requests from the incoming queue keeping the disk load perfectly balanced.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Lightweight Transactions at Lightning SpeedScyllaDB
This talk will outline the Scylla implementation of Lightweight Transactions (LWT) that brings us to parity with Apache Cassandra. We will cover how to use it, what is working, and what is left to be done. We will also cover what other improvements are in store to improve Scylla's transactional capabilities and why it matters.
Presto talk @ Global AI conference 2018 Bostonkbajda
Presented at Global AI Conference in Boston 2018:
http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html
Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.
Temporal Performance Modelling of Serverless Computing Platforms - WoSC6Nima Mahmoudi
This presentation is an overview of the "Temporal Performance Modeling of Serverless Computing Platforms" paper published in Sixth International Workshop on Serverless Computing (WoSC6) 2020 as part of IEEE Middleware conference.
Authors: Nima Mahmoudi and Hamzeh Khazaei
Paper: https://www.serverlesscomputing.org/wosc6/#p1
Preprint and Artifacts: https://research.nima-dev.com/publication/mahmoudi-2020-tempperf/
Full Presentation: https://youtu.be/9r3j_1B5t8c
Lightning Talk (1 min): https://youtu.be/E5KigIq0Z1E
PACS Lab: https://pacs.eecs.yorku.ca/
Scylla Summit 2022: IO Scheduling & NVMe Disk ModellingScyllaDB
Join ScyllaDB engineer Pavel Emelyanov who will provide a walkthrough of Diskplorer, an open-source disk latency/bandwidth exploring toolset to measure behavior under load. By using Linux fio under the hood Diskplorer runs a battery of measurements to discover performance characteristics for a specific hardware configuration, giving you an at-a-glance view of how server storage I/O will behave under load.
Discover how ScyllaDB uses this elaborated model of disk performance, as well as a scheduling algorithm developed for the Seastar framework to build latency-oriented I/O scheduling that cherry-picks requests from the incoming queue keeping the disk load perfectly balanced.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Lightweight Transactions at Lightning SpeedScyllaDB
This talk will outline the Scylla implementation of Lightweight Transactions (LWT) that brings us to parity with Apache Cassandra. We will cover how to use it, what is working, and what is left to be done. We will also cover what other improvements are in store to improve Scylla's transactional capabilities and why it matters.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC
Speaker: Oskari Saarenmaa
Aiven PostgreSQL is available in five different public cloud providers' infrastructure in more than 60 regions around the world, including 18 in APAC. This has given us a unique opportunity to benchmark and compare performance of similar configurations in different environments.
We'll share our benchmark methods and results, comparing various PostgreSQL configurations and workloads across different clouds.
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...Brittany Ingram
Containers 101 meetup talk recording posted here- https://codefresh.io/blog/containers-101-meetup-docker-accelerates-continuous-development/
Shimon Tolts, General Manager/ CTO of Data Solutions at ironSouce, joined us to talk about how they leverage Docker to simplify their workflow and deliver Big Data solutions to their customers faster. He shared their experience running Docker containers in production and how they took one of their base systems, considered "the backbone of the company," and transformed it using containers.
Solr Power FTW: Powering NoSQL the World OverAlex Pinkin
Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.
Since its inception, Scylla has offered a compelling alternative to Apache Cassandra, providing better performance for a lower cost of ownership.
With Scylla Open Source 4.0 we continue to extend our CQL interface features and capabilities and also now provide an open source alternative to DynamoDB, allowing you to run your workloads anywhere, on any cloud provider, or on premises.
Join ScyllaDB co-founders, CTO Avi Kivity and CEO Dor Laor, for a look at the new features in Scylla Open Source 4.0, and architectural and cost comparisons with the coming Cassandra 4.0.
Topics will include:
Improved consistency with our new Lightweight Transactions
Scylla Operator for Kubernetes
How we stack up against Apache Cassandra 4.0
Our “run anywhere” DynamoDB alternative
Monitoring NGINX (plus): key metrics and how-toDatadog
NGINX just works and that's why we use it. That does not mean that it should be left unmonitored. As a web server, it plays a central role in a modern infrastructure. As a gatekeeper, it sees every interaction with the application. If you monitor it properly it can explain a lot about what is happening in the rest of your infrastructure.
In this talk you will learn more about NGINX (plus) metrics, what they mean and how to use them. You will also learn different methods (status, statsd, logs) to monitor NGINX with their pros and cons, illustrated with real data coming from real servers.
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
Ed Huang, CTO of PingCAP, talked at Go System Conference about dealing with the typical and profound issues related to Go’s runtime as your systems become more complex. Taking TiDB as an example, he demonstrated how these problems can be reproduced, located, and analyzed in production.
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB
In this talk AWS’ Ken Krupa, Head of Specialized Solutions Architecture, will describe the architecture and capabilities of two new AWS EC2 instance types perfect for data-intensive storage and IO-heavy workloads like ScyllaDB: the Intel-based I4i and the Graviton2-based I4g series.
The Intel Xeon Ice Lake-based I4i series provides unparalleled raw horsepower for your most demanding workloads. Meanwhile, the Graviton2-powered I4g instances provide lower cost per storage on a power-efficient platform to deploy your cloud-native applications.
Ken will also describe the AWS Nitro SSD, a new form of high-speed NVMe storage with a Flash Translation Layer built with Nitro controllers, which powers both of these instance families.
ScyllaDB VP of Product Tzach Livyatan will then share benchmarking results showing how ScyllaDB behaves under load on these two instance types, providing maximum system utility and efficiency.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Order from chaos: automating monitoring configurationSensu Inc.
In a high-performance computing shop with over 3,000 nodes, Harvard FAS Research Computing can’t afford chaos around our monitoring checks! In this Sensu Summit 2019 talk, you'll hear from Harvard SRE Molly Duggan about how they’re using CI/CD pipelines and the Sensu Go API to ensure that all changes to their monitoring system are validated, reproducible, and version controlled.
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
This is a talk at at Barcamp Bangkhen 2018,
presented by Kriangkrai Chaonithi.
I shared my experience at Credit OK on building a data pipeline to ingest huge amount of customer data to our big data analytic warehouse using serverless services on Google platform.
As a result, we can make it without setting up any servers to handle our data at a very minimal cost.
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system.
Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won’t come back to haunt you.
With some basic stream operations (count, filter, … ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream. But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows.
After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.
Iceberg: a modern table format for big data (Ryan Blue & Parth Brahmbhatt, Netflix)
Presto Summit 2018 (https://www.starburstdata.com/technical-blog/presto-summit-2018-recap/)
PGConf APAC 2018 - PostgreSQL performance comparison in various cloudsPGConf APAC
Speaker: Oskari Saarenmaa
Aiven PostgreSQL is available in five different public cloud providers' infrastructure in more than 60 regions around the world, including 18 in APAC. This has given us a unique opportunity to benchmark and compare performance of similar configurations in different environments.
We'll share our benchmark methods and results, comparing various PostgreSQL configurations and workloads across different clouds.
How Docker Accelerates Continuous Development at ironSource: Containers #101 ...Brittany Ingram
Containers 101 meetup talk recording posted here- https://codefresh.io/blog/containers-101-meetup-docker-accelerates-continuous-development/
Shimon Tolts, General Manager/ CTO of Data Solutions at ironSouce, joined us to talk about how they leverage Docker to simplify their workflow and deliver Big Data solutions to their customers faster. He shared their experience running Docker containers in production and how they took one of their base systems, considered "the backbone of the company," and transformed it using containers.
Solr Power FTW: Powering NoSQL the World OverAlex Pinkin
Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.
Since its inception, Scylla has offered a compelling alternative to Apache Cassandra, providing better performance for a lower cost of ownership.
With Scylla Open Source 4.0 we continue to extend our CQL interface features and capabilities and also now provide an open source alternative to DynamoDB, allowing you to run your workloads anywhere, on any cloud provider, or on premises.
Join ScyllaDB co-founders, CTO Avi Kivity and CEO Dor Laor, for a look at the new features in Scylla Open Source 4.0, and architectural and cost comparisons with the coming Cassandra 4.0.
Topics will include:
Improved consistency with our new Lightweight Transactions
Scylla Operator for Kubernetes
How we stack up against Apache Cassandra 4.0
Our “run anywhere” DynamoDB alternative
Monitoring NGINX (plus): key metrics and how-toDatadog
NGINX just works and that's why we use it. That does not mean that it should be left unmonitored. As a web server, it plays a central role in a modern infrastructure. As a gatekeeper, it sees every interaction with the application. If you monitor it properly it can explain a lot about what is happening in the rest of your infrastructure.
In this talk you will learn more about NGINX (plus) metrics, what they mean and how to use them. You will also learn different methods (status, statsd, logs) to monitor NGINX with their pros and cons, illustrated with real data coming from real servers.
The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP
Ed Huang, CTO of PingCAP, talked at Go System Conference about dealing with the typical and profound issues related to Go’s runtime as your systems become more complex. Taking TiDB as an example, he demonstrated how these problems can be reproduced, located, and analyzed in production.
Scylla Summit 2022: New AWS Instances Perfect for ScyllaDBScyllaDB
In this talk AWS’ Ken Krupa, Head of Specialized Solutions Architecture, will describe the architecture and capabilities of two new AWS EC2 instance types perfect for data-intensive storage and IO-heavy workloads like ScyllaDB: the Intel-based I4i and the Graviton2-based I4g series.
The Intel Xeon Ice Lake-based I4i series provides unparalleled raw horsepower for your most demanding workloads. Meanwhile, the Graviton2-powered I4g instances provide lower cost per storage on a power-efficient platform to deploy your cloud-native applications.
Ken will also describe the AWS Nitro SSD, a new form of high-speed NVMe storage with a Flash Translation Layer built with Nitro controllers, which powers both of these instance families.
ScyllaDB VP of Product Tzach Livyatan will then share benchmarking results showing how ScyllaDB behaves under load on these two instance types, providing maximum system utility and efficiency.
To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.
Order from chaos: automating monitoring configurationSensu Inc.
In a high-performance computing shop with over 3,000 nodes, Harvard FAS Research Computing can’t afford chaos around our monitoring checks! In this Sensu Summit 2019 talk, you'll hear from Harvard SRE Molly Duggan about how they’re using CI/CD pipelines and the Sensu Go API to ensure that all changes to their monitoring system are validated, reproducible, and version controlled.
Serverless Big Data Architecture on Google Cloud Platform at Credit OKKriangkrai Chaonithi
This is a talk at at Barcamp Bangkhen 2018,
presented by Kriangkrai Chaonithi.
I shared my experience at Credit OK on building a data pipeline to ingest huge amount of customer data to our big data analytic warehouse using serverless services on Google platform.
As a result, we can make it without setting up any servers to handle our data at a very minimal cost.
RubiX: A caching framework for big data engines in the cloud. Helps provide data caching capabilities to engines like Presto, Spark, Hadoop, etc transparently without user intervention.
Stream Processing Live Traffic Data with Kafka StreamsTim Ysewyn
In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system.
Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won’t come back to haunt you.
With some basic stream operations (count, filter, … ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream. But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows.
After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.
Enabling Presto to handle massive scale at lightning speedShubham Tagra
Presto User Group Singapore Meetup - March 2019.
These slides talk through the current state of Presto and features that help Presto work better in cloud and a glimpse into the roadmap
Building Pinterest Real-Time Ads Platform Using Kafka Streams confluent
Building Pinterest Real-Time Ads Platform Using Kafka Streams (Liquan Pei + Boyang Chen, Pinterest) Kafka Summit SF 2018
In this talk, we are sharing the experience of building Pinterest’s real-time Ads Platform utilizing Kafka Streams. The real-time budgeting system is the most mission-critical component of the Ads Platform as it controls how each ad is delivered to maximize user, advertiser and Pinterest value. The system needs to handle over 50,000 queries per section (QPS) impressions, requires less than five seconds of end-to-end latency and recovers within five minutes during outages. It also needs to be scalable to handle the fast growth of Pinterest’s ads business.
The real-time budgeting system is composed of real-time stream-stream joiner, real-time spend aggregator and a spend predictor. At Pinterest’s scale, we need to overcome quite a few challenges to make each component work. For example, the stream-stream joiner needs to maintain terabyte size state while supporting fast recovery, and the real-time spend aggregator needs to publish to thousands of ads servers while supporting over one million read QPS. We choose Kafka Streams as it provides milliseconds latency guarantee, scalable event-based processing and easy-to-use APIs. In the process of building the system, we performed tons of tuning to RocksDB, Kafka Producer and Consumer, and pushed several open source contributions to Apache Kafka. We are also working on adding a remote checkpoint for Kafka Streams state to reduce the time of code start when adding more machines to the application. We believe that our experience can be beneficial to people who want to build real-time streaming solutions at large scale and deeply understand Kafka Streams.
HiveServer2 provides a multi-tenant service end-point for executing Hive queries concurrently. It provides support for authentication and authorization, serves as a JDBC endpoint for users to connect and run queries via various tools, maintains sessions and warm containers for faster query processing, provides caching at multiple levels and much more. In other words, it is an integral component of any Hive deployment. HiveServer2 deployments however often face performance and reliability issues leading to catastrophic failures at times. At Qubole, we have augmented HiveServer2 to utilize the capabilities of the cloud to offer an enterprise-ready scalable and stable HiveServer2 (or HS2) service.
The HS2 experience on the cloud at Qubole, which is our primary platform of deployment, has been enhanced to automatically scale based on the customer’s workload; our solution adds and gracefully removes HS2 instances according to the requirement, thus making HS2 service not only self-sufficient at scale but also fault-tolerant. We have implemented Load Balancing for queries based on the resource utilization on HS2 instances to provide a reliable, efficient and cost-effective solution. A health monitoring service, based on past learnings and insights of running HS2 in customer deployments, implemented on top of this scalable HS2 service acts as the foundation for battle-tested, enterprise-ready solution for HS2 instances. In this talk, we will share the details of such an implementation, and the challenges faced in providing an auto-scalable, highly performant and reliable HS2 experience in the cloud.
Topics include:
* Workload-aware autoscaling for HS2 clusters.
* Agent-based adaptive load balancing of Hive queries on multi-tenant HS2 clusters.
* Durability monitoring using failure semantics and automated measures to provide reliability.
* Enterprise level security for HS2 on the cloud.
* Metrics, monitoring and alerting around the HS2 service.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Adaptive Scaling of Microgateways on KubernetesWSO2
As businesses start increasingly relying on Kubernetes, the need to scale services based on the business demand becomes more important. While the traditional methods like scaling based on the CPU and memory are important, expressing different business metrics in CPU and memory isn’t always straightforward. In this light, auto-scaling based on custom metrics in Kubernetes is going to be immensely helpful.
With the support for custom metrics, services can be scaled dynamically based on the request count or the error count of a particular service. This helps services respond smoothly to sudden bursts and traffic variations ensuring business continuity, also allowing resources allocated optimally among different services.
With its new release, the WSO2 Microgateway supports scaling based on custom metrics, enabling enterprises to scale the runtimes based on request count, error rate, requests in the pipeline, and more.
This slide deck will cover:
- The importance of selecting business-related metrics
- Custom metric support in WSO2 Microgateway
- A demo on auto-scaling WSO2 Microgateway based on request count
On-demand webinar: https://wso2.com/library/webinars/adaptive-scaling-of-microgateways-on-kubernetes/
Zero Downtime Architectures based on JEE platform. Almost every big enterprise with online business tries to design its applications in a way that they are always online. But is it also the case when we upgrade the database cluster? When we switch the whole data center? Based on a customer project we try to present common architecture principles that enable you to do all this without any service interruption and the most important: without any stress.
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumbergerinside-BigData.com
In this presentation from the GPU Technology Conference, Wyatt Gorman from Google and Abhishek Gupta from Schlumberger present: Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger.
"Demand for GPUs in High Performance Computing is only growing, and it is costly and difficult to keep pace in an entirely on-premise environment. We will hear from Schlumberger on why and how they are utilizing cloud-based GPU-enabled computing resources from Google Cloud to supply their users with the computing power they need, from exploration and modeling to visualization."
Watch the video: https://wp.me/p3RLHQ-kcl
Learn more: https://www.blog.google/products/google-cloud/schlumberger-chooses-gcp-to-deliver-new-oil-and-gas-technology-platform/
and
https://www.nvidia.com/en-us/gtc/
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
Audience: Advanced
About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand.
This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production.
Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more!
Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited
Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies.
OpenStack Australia Day - Sydney 2016
https://events.aptira.com/openstack-australia-day-sydney-2016/
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Mariano Gonzalez
Modernizing analytics data pipelines to gain the most of your data while optimizing costs can be challenging. However, today cloud providers offer a good set of services that can help with this endeavor. We will do a tour across some GCP services during this hands-on session, using DataFlow (apache beam) as the backbone to architect a modern analytics pipeline to wire them all together.
Presto User Group Singapore Meetup - March 2019.
This presentation talks about the Grab's deployment of Presto and walks you through Grab's journey of Presto in Cloud.
An Approach to Detecting Writing Styles Based on Clustering Techniquesambekarshweta25
An Approach to Detecting Writing Styles Based on Clustering Techniques
Authors:
-Devkinandan Jagtap
-Shweta Ambekar
-Harshit Singh
-Nakul Sharma (Assistant Professor)
Institution:
VIIT Pune, India
Abstract:
This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection.
Introduction:
Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods.
Methodology:
The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering.
Results:
Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized.
Conclusion and Future Work:
The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
6th International Conference on Machine Learning & Applications (CMLA 2024)ClaraZara1
6th International Conference on Machine Learning & Applications (CMLA 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of on Machine Learning & Applications.
HEAP SORT ILLUSTRATED WITH HEAPIFY, BUILD HEAP FOR DYNAMIC ARRAYS.
Heap sort is a comparison-based sorting technique based on Binary Heap data structure. It is similar to the selection sort where we first find the minimum element and place the minimum element at the beginning. Repeat the same process for the remaining elements.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
3. Built for Anyone who Uses Data
Analysts l Data Scientists l Data Engineers l Data Admins
Optimize performance, cost,
and scale through
automation, control and
orchestration of big data
workloads.
A Single Platform for Any Use Case
ETL & Reporting l Ad Hoc Queries l Machine Learning l
Streaming l Vertical Apps
Open Source Engines, Optimized for the Cloud
Native Integration with multiple cloud providers