Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

•Download as PPTX, PDF•

1 like•134 views

Strata SF 2019 presentation about presto's limitation in leveraging spot nodes, qubole's features to reliably use spot nodes in presto and case study on the efficacy of the solution

Engineering

00Copyright 2017 © Qubole
Cost Effective Presto on AWS
with Spot Nodes
Strata Data Conference
March 2019
Shubham Tagra (stagra@qubole.com)

00Copyright 2017 © Qubole
Agenda
● Introduction to Presto and Spot Nodes
● Problems with Presto on Spot Nodes
● Qubole's journey of Presto on Spot Nodes
● Case study

Built for Anyone who Uses Data
Analysts l Data Scientists l Data Engineers l Data Admins
Optimize performance, cost,
and scale through
automation, control and
orchestration of big data
workloads.
A Single Platform for Any Use Case
ETL & Reporting l Ad Hoc Queries l Machine Learning l
Streaming l Vertical Apps
Open Source Engines, Optimized for the Cloud
Native Integration with multiple cloud providers

00Copyright 2017 © Qubole
Presto
● High Performance SQL Query Engine for Big Data
● In-memory and pipelined execution model
● Well suited for Adhoc and reporting use-cases
● Gained huge popularity in recent years
● More than 400% growth rate in Qubole in 2018

00Copyright 2017 © Qubole
Spot Nodes
● Surplus compute at AWS
● Available at highly discounted price
● Can be interrupted
● Well suited for stateless services

00Copyright 2017 © Qubole
● Presto is fault Intolerant
● Good chances of Spot Node being interrupted
● Query Failures around the interruption
Presto + Spot Nodes?
● High Performance + lower cost
● Obvious match? NO

00Copyright 2017 © Qubole
Presto's reasons for fault intolerance
● In-Memory execution
● Aggressive pipelining

00Copyright 2017 © Qubole
Qubole's journey of Presto on Spot Nodes
● Maximum Spot Percentage
■ Management System for cluster with a mix of node types
■ Spot percentage in autoscaled nodes
■ Stable core size
■ Appropriate Fallbacks
■ Spot Rebalancer
■ More reliable than 100% spot cluster

00Copyright 2017 © Qubole
Qubole's journey of Presto on Spot Nodes
● Heterogeneous Clusters
■ Multiple Node Types
■ Multiple AZ support

00Copyright 2017 © Qubole
Qubole's journey of Presto on Spot Nodes
● Spot Termination Notification (STN)
■ 2 minutes notice of interruption
■ STN Handling
■ Node Blacklisting
■ New node bringup
■ Spot rebalancer
■ Solves for short queries

00Copyright 2017 © Qubole
Qubole's journey of Presto on Spot Nodes
● Smart Query Retry
■ Cluster-aware Retry System
■ Provided through the Presto Server
● Smart Query Retry Requirements
■ Avoid unnecessary retries
■ Wait for the rollback of failed query
■ Should not retry if partial results were returned
■ Retry should be transparent to Presto clients

00Copyright 2017 © Qubole
Case Study
● A service of several Presto clusters
● Each cluster configured with min=15, max=25 and max 100% spot nodes in autoscaled nodes
● Metrics collected around queries, nodes, ec2 events, query retries, etc
● 2-months sample period

00Copyright 2017 © Qubole
Case Study - Smart Query Retry Impact

00Copyright 2017 © Qubole
Case Study - Costs

00Copyright 2017 © Qubole
Conclusion
● Spot nodes save considerable costs
● AWS provides some features to minimize impact of spot interruptions
● Smart Retries are needed to reliably leverage spot in Presto

Thanks and please Rate today’s
session
Session page on conference website O’Reilly Events App

Presented at Global AI Conference in Boston 2018: http://www.globalbigdataconference.com/boston/global-artificial-intelligence-conference-106/speaker-details/kamil-bajda-pawlikowski-62952.html Presto, an open source distributed SQL engine, is widely recognized for its low-latency queries, high concurrency, and native ability to query multiple data sources. Proven at scale in a variety of use cases at Facebook, Airbnb, Netflix, Uber, Twitter, LinkedIn, Bloomberg, and FINRA, Presto experienced an unprecedented growth in popularity in both on-premises and cloud deployments in the last few years. Presto is really a SQL-on-Anything engine in a single query can access data from Hadoop, S3-compatible object stores, RDBMS, NoSQL and custom data stores. This talk will cover some of the best use cases for Presto, recent advancements in the project such as Cost-Based Optimizer and Geospatial functions as well as discuss the roadmap going forward.

Ceilometer Updates - Kilo Edition

OpenStack Foundation

Enabling Presto Caching at Uber with Alluxio

Alluxio, Inc.

Temporal Performance Modelling of Serverless Computing Platforms - WoSC6

Nima Mahmoudi

This presentation is an overview of the "Temporal Performance Modeling of Serverless Computing Platforms" paper published in Sixth International Workshop on Serverless Computing (WoSC6) 2020 as part of IEEE Middleware conference. Authors: Nima Mahmoudi and Hamzeh Khazaei Paper: https://www.serverlesscomputing.org/wosc6/#p1 Preprint and Artifacts: https://research.nima-dev.com/publication/mahmoudi-2020-tempperf/ Full Presentation: https://youtu.be/9r3j_1B5t8c Lightning Talk (1 min): https://youtu.be/E5KigIq0Z1E PACS Lab: https://pacs.eecs.yorku.ca/

Scylla Summit 2022: IO Scheduling & NVMe Disk Modelling

ScyllaDB

Join ScyllaDB engineer Pavel Emelyanov who will provide a walkthrough of Diskplorer, an open-source disk latency/bandwidth exploring toolset to measure behavior under load. By using Linux fio under the hood Diskplorer runs a battery of measurements to discover performance characteristics for a specific hardware configuration, giving you an at-a-glance view of how server storage I/O will behave under load. Discover how ScyllaDB uses this elaborated model of disk performance, as well as a scheduling algorithm developed for the Seastar framework to build latency-oriented I/O scheduling that cherry-picks requests from the incoming queue keeping the disk load perfectly balanced. To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.

Lightweight Transactions at Lightning Speed

ScyllaDB

Speaker: Oskari Saarenmaa Aiven PostgreSQL is available in five different public cloud providers' infrastructure in more than 60 regions around the world, including 18 in APAC. This has given us a unique opportunity to benchmark and compare performance of similar configurations in different environments. We'll share our benchmark methods and results, comparing various PostgreSQL configurations and workloads across different clouds.

How Docker Accelerates Continuous Development at ironSource: Containers #101 ...

Brittany Ingram

Containers 101 meetup talk recording posted here- https://codefresh.io/blog/containers-101-meetup-docker-accelerates-continuous-development/ Shimon Tolts, General Manager/ CTO of Data Solutions at ironSouce, joined us to talk about how they leverage Docker to simplify their workflow and deliver Big Data solutions to their customers faster. He shared their experience running Docker containers in production and how they took one of their base systems, considered "the backbone of the company," and transformed it using containers.

Solr Power FTW: Powering NoSQL the World Over

Alex Pinkin

Solr is an open source, Lucene based search platform originally developed by CNET and used by the likes of Netflix, Yelp, and StubHub which has been rapidly growing in popularity and features during the last few years. Learn how Solr can be used as a Not Only SQL (NoSQL) database along the lines of Cassandra, Memcached, and Redis. NoSQL data stores are regularly described as non-relational, distributed, internet-scalable and are used at both Facebook and Digg. This presentation will quickly cover the fundamentals of NoSQL data stores, the basics of Lucene, and what Solr brings to the table. Following that we will dive into the technical details of making Solr your primary query engine on large scale web applications, thus relegating your traditional relational database to little more than a simple key store. Real solutions to problems like handling four billion requests per month will be presented. We'll talk about sizing and configuring the Solr instances to maintain rapid response times under heavy load. We'll show you how to change the schema on a live system with tens of millions of documents indexed while supporting real-time results. And finally, we'll answer your questions about ways to work around the lack of transactions in Solr and how you can do all of this in a highly available solution.

Introducing Scylla Open Source 4.0

ScyllaDB

Since its inception, Scylla has offered a compelling alternative to Apache Cassandra, providing better performance for a lower cost of ownership. With Scylla Open Source 4.0 we continue to extend our CQL interface features and capabilities and also now provide an open source alternative to DynamoDB, allowing you to run your workloads anywhere, on any cloud provider, or on premises. Join ScyllaDB co-founders, CTO Avi Kivity and CEO Dor Laor, for a look at the new features in Scylla Open Source 4.0, and architectural and cost comparisons with the coming Cassandra 4.0. Topics will include: Improved consistency with our new Lightweight Transactions Scylla Operator for Kubernetes How we stack up against Apache Cassandra 4.0 Our “run anywhere” DynamoDB alternative

Monitoring NGINX (plus): key metrics and how-to

Datadog

NGINX just works and that's why we use it. That does not mean that it should be left unmonitored. As a web server, it plays a central role in a modern infrastructure. As a gatekeeper, it sees every interaction with the application. If you monitor it properly it can explain a lot about what is happening in the rest of your infrastructure. In this talk you will learn more about NGINX (plus) metrics, what they mean and how to use them. You will also learn different methods (status, statsd, logs) to monitor NGINX with their pros and cons, illustrated with real data coming from real servers.

The Dark Side Of Go -- Go runtime related problems in TiDB in production

PingCAP

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

ScyllaDB

In this talk AWS’ Ken Krupa, Head of Specialized Solutions Architecture, will describe the architecture and capabilities of two new AWS EC2 instance types perfect for data-intensive storage and IO-heavy workloads like ScyllaDB: the Intel-based I4i and the Graviton2-based I4g series. The Intel Xeon Ice Lake-based I4i series provides unparalleled raw horsepower for your most demanding workloads. Meanwhile, the Graviton2-powered I4g instances provide lower cost per storage on a power-efficient platform to deploy your cloud-native applications. Ken will also describe the AWS Nitro SSD, a new form of high-speed NVMe storage with a Flash Translation Layer built with Nitro controllers, which powers both of these instance families. ScyllaDB VP of Product Tzach Livyatan will then share benchmarking results showing how ScyllaDB behaves under load on these two instance types, providing maximum system utility and efficiency. To watch all of the recordings hosted during Scylla Summit 2022 visit our website here: https://www.scylladb.com/summit.

Nginx monitoring with graphite

damaex17

InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...

Caner Ünal

Order from chaos: automating monitoring configuration

Sensu Inc.

OSOM Operations in the Cloudmstuparu

OSOM - Operations in the Cloud

Marcela Oniga

Serverless Big Data Architecture on Google Cloud Platform at Credit OK

Kriangkrai Chaonithi

Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017

Bob Cotton

Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert

Web à Québec

RubiX

Shubham Tagra

Stream Processing Live Traffic Data with Kafka Streams

Tim Ysewyn

In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system. Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won’t come back to haunt you. With some basic stream operations (count, filter, … ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream. But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows. After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.

From monolith to microservice with containers.

Marcel Dempers

Spot at qubole

Ajaya Agrawal

Key considerations in productionizing streaming applications

KafkaZone

What's hot

Presto Summit 2018 - 09 - Netflix Iceberg

kbajda

Speed Up Uber's Presto with Alluxio

Alluxio, Inc.

PGConf APAC 2018 - PostgreSQL performance comparison in various clouds

PGConf APAC

How Docker Accelerates Continuous Development at ironSource: Containers #101 ...

Brittany Ingram

Solr Power FTW: Powering NoSQL the World Over

Alex Pinkin

Introducing Scylla Open Source 4.0

ScyllaDB

Monitoring NGINX (plus): key metrics and how-to

Datadog

The Dark Side Of Go -- Go runtime related problems in TiDB in production

PingCAP

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

ScyllaDB

Nginx monitoring with graphite

damaex17

InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...

Caner Ünal

Order from chaos: automating monitoring configuration

Sensu Inc.

OSOM Operations in the Cloudmstuparu

OSOM - Operations in the Cloud

Marcela Oniga

Serverless Big Data Architecture on Google Cloud Platform at Credit OK

Kriangkrai Chaonithi

Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017

Bob Cotton

Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert

Web à Québec

RubiX

Shubham Tagra

Stream Processing Live Traffic Data with Kafka Streams

Tim Ysewyn

From monolith to microservice with containers.

Marcel Dempers

What's hot (20)

Presto Summit 2018 - 09 - Netflix Iceberg

Speed Up Uber's Presto with Alluxio

PGConf APAC 2018 - PostgreSQL performance comparison in various clouds

How Docker Accelerates Continuous Development at ironSource: Containers #101 ...

Solr Power FTW: Powering NoSQL the World Over

Introducing Scylla Open Source 4.0

Monitoring NGINX (plus): key metrics and how-to

The Dark Side Of Go -- Go runtime related problems in TiDB in production

Scylla Summit 2022: New AWS Instances Perfect for ScyllaDB

Nginx monitoring with graphite

InfluxDB and Grafana: An Introduction to Time-Based Data Storage and Visualiz...

Order from chaos: automating monitoring configuration

OSOM Operations in the Cloud

OSOM - Operations in the Cloud

Serverless Big Data Architecture on Google Cloud Platform at Credit OK

Kubernetes Colorado - Kubernetes metrics deep dive 10/25/2017

Scale search powered apps with Elastisearch, k8s and go - Maxime Boisvert

RubiX

Stream Processing Live Traffic Data with Kafka Streams

From monolith to microservice with containers.

Similar to Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

Spot at qubole

Ajaya Agrawal

Key considerations in productionizing streaming applications

KafkaZone

Enabling Presto to handle massive scale at lightning speed

Shubham Tagra

Enabling presto to handle massive scale at lightning speed

Shubham Tagra

Building Pinterest Real-Time Ads Platform Using Kafka Streams

confluent

Building Pinterest Real-Time Ads Platform Using Kafka Streams (Liquan Pei + Boyang Chen, Pinterest) Kafka Summit SF 2018 In this talk, we are sharing the experience of building Pinterest’s real-time Ads Platform utilizing Kafka Streams. The real-time budgeting system is the most mission-critical component of the Ads Platform as it controls how each ad is delivered to maximize user, advertiser and Pinterest value. The system needs to handle over 50,000 queries per section (QPS) impressions, requires less than five seconds of end-to-end latency and recovers within five minutes during outages. It also needs to be scalable to handle the fast growth of Pinterest’s ads business. The real-time budgeting system is composed of real-time stream-stream joiner, real-time spend aggregator and a spend predictor. At Pinterest’s scale, we need to overcome quite a few challenges to make each component work. For example, the stream-stream joiner needs to maintain terabyte size state while supporting fast recovery, and the real-time spend aggregator needs to publish to thousands of ads servers while supporting over one million read QPS. We choose Kafka Streams as it provides milliseconds latency guarantee, scalable event-based processing and easy-to-use APIs. In the process of building the system, we performed tons of tuning to RocksDB, Kafka Producer and Consumer, and pushed several open source contributions to Apache Kafka. We are also working on adding a remote checkpoint for Kafka Streams state to reduce the time of code start when adding more machines to the application. We believe that our experience can be beneficial to people who want to build real-time streaming solutions at large scale and deeply understand Kafka Streams.

Scalable HiveServer2 as a Service

DataWorks Summit

HiveServer2 provides a multi-tenant service end-point for executing Hive queries concurrently. It provides support for authentication and authorization, serves as a JDBC endpoint for users to connect and run queries via various tools, maintains sessions and warm containers for faster query processing, provides caching at multiple levels and much more. In other words, it is an integral component of any Hive deployment. HiveServer2 deployments however often face performance and reliability issues leading to catastrophic failures at times. At Qubole, we have augmented HiveServer2 to utilize the capabilities of the cloud to offer an enterprise-ready scalable and stable HiveServer2 (or HS2) service. The HS2 experience on the cloud at Qubole, which is our primary platform of deployment, has been enhanced to automatically scale based on the customer’s workload; our solution adds and gracefully removes HS2 instances according to the requirement, thus making HS2 service not only self-sufficient at scale but also fault-tolerant. We have implemented Load Balancing for queries based on the resource utilization on HS2 instances to provide a reliable, efficient and cost-effective solution. A health monitoring service, based on past learnings and insights of running HS2 in customer deployments, implemented on top of this scalable HS2 service acts as the foundation for battle-tested, enterprise-ready solution for HS2 instances. In this talk, we will share the details of such an implementation, and the challenges faced in providing an auto-scalable, highly performant and reliable HS2 experience in the cloud. Topics include: * Workload-aware autoscaling for HS2 clusters. * Agent-based adaptive load balancing of Hive queries on multi-tenant HS2 clusters. * Durability monitoring using failure semantics and automated measures to provide reliability. * Enterprise level security for HS2 on the cloud. * Metrics, monitoring and alerting around the HS2 service.

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

confluent

Autoscaling Kubernetes

craigbox

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

HostedbyConfluent

Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka. One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status. Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.

Scaling Monitoring At Databricks From Prometheus to M3

LibbySchulze

Adaptive Scaling of Microgateways on Kubernetes

WSO2

As businesses start increasingly relying on Kubernetes, the need to scale services based on the business demand becomes more important. While the traditional methods like scaling based on the CPU and memory are important, expressing different business metrics in CPU and memory isn’t always straightforward. In this light, auto-scaling based on custom metrics in Kubernetes is going to be immensely helpful. With the support for custom metrics, services can be scaled dynamically based on the request count or the error count of a particular service. This helps services respond smoothly to sudden bursts and traffic variations ensuring business continuity, also allowing resources allocated optimally among different services. With its new release, the WSO2 Microgateway supports scaling based on custom metrics, enabling enterprises to scale the runtimes based on request count, error rate, requests in the pipeline, and more. This slide deck will cover: - The importance of selecting business-related metrics - Custom metric support in WSO2 Microgateway - A demo on auto-scaling WSO2 Microgateway based on request count On-demand webinar: https://wso2.com/library/webinars/adaptive-scaling-of-microgateways-on-kubernetes/

QueueMetrics Live

Clarotech_Events

AWS Cloud cost optimization

Yogesh Sharma

Zero Downtime JEE Architectures

Alexander Penev

Zero Downtime Architectures based on JEE platform. Almost every big enterprise with online business tries to design its applications in a way that they are always online. But is it also the case when we upgrade the database cluster? When we switch the whole data center? Based on a customer project we try to present common architecture principles that enable you to do all this without any service interruption and the most important: without any stress.

Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger

inside-BigData.com

In this presentation from the GPU Technology Conference, Wyatt Gorman from Google and Abhishek Gupta from Schlumberger present: Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger. "Demand for GPUs in High Performance Computing is only growing, and it is costly and difficult to keep pace in an entirely on-premise environment. We will hear from Schlumberger on why and how they are utilizing cloud-based GPU-enabled computing resources from Google Cloud to supply their users with the computing power they need, from exploration and modeling to visualization." Watch the video: https://wp.me/p3RLHQ-kcl Learn more: https://www.blog.google/products/google-cloud/schlumberger-chooses-gcp-to-deliver-new-oil-and-gas-technology-platform/ and https://www.nvidia.com/en-us/gtc/

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT

OpenStack

Audience: Advanced About: Real world lessons and war stories about Catalyst IT’s experience in rolling out an OpenStack based public cloud in New Zealand. This presentation will provide tips and advice that may save you a lot of time, money and nights of sleep if you are planning to run OpenStack in the future. It may also bring some insights to people that are already running OpenStack in production. Topics covered will include: selection of hardware for optimal costs, techniques that drive quality and service levels up, common deployment mistakes, in place upgrades, how to identify the maturity level of each project and decide what is ready for production, and much more! Speaker Bio: Bruno Lago – Entrepreneur, Catalyst IT Limited Bruno Lago is a solutions architect that has been involved with the Catalyst Cloud (New Zealand’s first public cloud based on OpenStack) from its inception. He is passionate about open source software, cloud computing and disruptive technologies. OpenStack Australia Day - Sydney 2016 https://events.aptira.com/openstack-australia-day-sydney-2016/

The what, why and how of knative

Mofizur Rahman

Kafka Practices @ Uber - Seattle Apache Kafka meetup

Mingmin Chen

Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder. Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Flink) and in-house technologies have helped Uber scale.

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Mariano Gonzalez

Boyan Krosnov - Building a software-defined cloud - our experience

ShapeBlue

Similar to Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019 (20)

Spot at qubole

Key considerations in productionizing streaming applications

Enabling Presto to handle massive scale at lightning speed

Enabling presto to handle massive scale at lightning speed

Building Pinterest Real-Time Ads Platform Using Kafka Streams

Scalable HiveServer2 as a Service

Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber

Autoscaling Kubernetes

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...

Scaling Monitoring At Databricks From Prometheus to M3

Adaptive Scaling of Microgateways on Kubernetes

QueueMetrics Live

AWS Cloud cost optimization

Zero Downtime JEE Architectures

Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT

The what, why and how of knative

Kafka Practices @ Uber - Seattle Apache Kafka meetup

Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020

Boyan Krosnov - Building a software-defined cloud - our experience

Recently uploaded

An Approach to Detecting Writing Styles Based on Clustering Techniques

ambekarshweta25

An Approach to Detecting Writing Styles Based on Clustering Techniques Authors: -Devkinandan Jagtap -Shweta Ambekar -Harshit Singh -Nakul Sharma (Assistant Professor) Institution: VIIT Pune, India Abstract: This paper proposes a system to differentiate between human-generated and AI-generated texts using stylometric analysis. The system analyzes text files and classifies writing styles by employing various clustering algorithms, such as k-means, k-means++, hierarchical, and DBSCAN. The effectiveness of these algorithms is measured using silhouette scores. The system successfully identifies distinct writing styles within documents, demonstrating its potential for plagiarism detection. Introduction: Stylometry, the study of linguistic and structural features in texts, is used for tasks like plagiarism detection, genre separation, and author verification. This paper leverages stylometric analysis to identify different writing styles and improve plagiarism detection methods. Methodology: The system includes data collection, preprocessing, feature extraction, dimensional reduction, machine learning models for clustering, and performance comparison using silhouette scores. Feature extraction focuses on lexical features, vocabulary richness, and readability scores. The study uses a small dataset of texts from various authors and employs algorithms like k-means, k-means++, hierarchical clustering, and DBSCAN for clustering. Results: Experiments show that the system effectively identifies writing styles, with silhouette scores indicating reasonable to strong clustering when k=2. As the number of clusters increases, the silhouette scores decrease, indicating a drop in accuracy. K-means and k-means++ perform similarly, while hierarchical clustering is less optimized. Conclusion and Future Work: The system works well for distinguishing writing styles with two clusters but becomes less accurate as the number of clusters increases. Future research could focus on adding more parameters and optimizing the methodology to improve accuracy with higher cluster values. This system can enhance existing plagiarism detection tools, especially in academic settings.

Understanding Inductive Bias in Machine Learning

SUTEJAS

This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models. The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees. By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.

basic-wireline-operations-course-mahmoud-f-radwan.pdf

NidhalKahouli2

在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样

obonagu

学校原件一模一样【微信：741003700 】《(ANU毕业证书)澳洲国立大学毕业证》【微信：741003700 】学位证，留信认证（真实可查，永久存档）原件一模一样纸张工艺/offer、雅思、外壳等材料/诚信可靠,可直接看成品样本，帮您解决无法毕业带来的各种难题！外壳，原版制作，诚信可靠，可直接看成品样本。行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备。十五年致力于帮助留学生解决难题，包您满意。本公司拥有海外各大学样板无数，能完美还原。 1:1完美还原海外各大学毕业材料上的工艺：水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠。文字图案浮雕、激光镭射、紫外荧光、温感、复印防伪等防伪工艺。材料咨询办理、认证咨询办理请加学历顾问Q/微741003700 【主营项目】一.毕业证【q微741003700】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【q/微741003700】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才

Modelagem de um CSTR com reação endotermica.pdf

camseq

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B

Sreedhar Chowdam

Planning Of Procurement o different goods and services

JoytuBarua2

KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions

Victor Morales

6th International Conference on Machine Learning & Applications (CMLA 2024)

ClaraZara1

一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理

ydteq

UofT毕业证原版定制【微信：176555708】【多伦多大学毕业证成绩单-学位证】【微信：176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。 ◆◆◆◆◆ — — — — — — — — 【留学教育】留学归国服务中心 — — — — — -◆◆◆◆◆ 【主营项目】一.毕业证【微信：176555708】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【微信：176555708】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分→ 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！学历顾问：微信：176555708

一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理

bakpo1

SFU毕业证原版定制【微信：176555708】【西蒙菲莎大学毕业证成绩单-学位证】【微信：176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。 ◆◆◆◆◆ — — — — — — — — 【留学教育】留学归国服务中心 — — — — — -◆◆◆◆◆ 【主营项目】一.毕业证【微信：176555708】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【微信：176555708】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分→ 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！学历顾问：微信：176555708

digital fundamental by Thomas L.floydl.pdf

drwaing

Swimming pool mechanical components design.pptx

yokeleetan1

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS

Soumen Santra

14 Template Contractual Notice - EOT Application

SyedAbiiAzazi1

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理

zwunae

UMich毕业证原版定制【微信：176555708】【密歇根大学|安娜堡分校毕业证成绩单-学位证】【微信：176555708】（留信学历认证永久存档查询）采用学校原版纸张、特殊工艺完全按照原版一比一制作（包括：隐形水印，阴影底纹，钢印LOGO烫金烫银，LOGO烫金烫银复合重叠，文字图案浮雕，激光镭射，紫外荧光，温感，复印防伪）行业标杆！精益求精，诚心合作，真诚制作！多年品质 ,按需精细制作，24小时接单,全套进口原装设备，十五年致力于帮助留学生解决难题，业务范围有加拿大、英国、澳洲、韩国、美国、新加坡，新西兰等学历材料，包您满意。 ◆◆◆◆◆ — — — — — — — — 【留学教育】留学归国服务中心 — — — — — -◆◆◆◆◆ 【主营项目】一.毕业证【微信：176555708】成绩单、使馆认证、教育部认证、雅思托福成绩单、学生卡等！二.真实使馆公证(即留学回国人员证明,不成功不收费) 三.真实教育部学历学位认证（教育部存档！教育部留服网站永久可查）四.办理各国各大学文凭(一对一专业服务,可全程监控跟踪进度) 如果您处于以下几种情况： ◇在校期间，因各种原因未能顺利毕业……拿不到官方毕业证【微信：176555708】 ◇面对父母的压力，希望尽快拿到； ◇不清楚认证流程以及材料该如何准备； ◇回国时间很长，忘记办理； ◇回国马上就要找工作，办给用人单位看； ◇企事业单位必须要求办理的 ◇需要报考公务员、购买免税车、落转户口 ◇申请留学生创业基金留信网认证的作用: 1:该专业认证可证明留学生真实身份 2:同时对留学生所学专业登记给予评定 3:国家专业人才认证中心颁发入库证书 4:这个认证书并且可以归档倒地方 5:凡事获得留信网入网的信息将会逐步更新到个人身份内，将在公安局网内查询个人身份证信息后，同步读取人才网入库信息 6:个人职称评审加20分 7:个人信誉贷款加10分→ 【关于价格问题（保证一手价格）我们所定的价格是非常合理的，而且我们现在做得单子大多数都是代理和回头客户介绍的所以一般现在有新的单子我给客户的都是第一手的代理价格，因为我想坦诚对待大家不想跟大家在价格方面浪费时间对于老客户或者被老客户介绍过来的朋友，我们都会适当给一些优惠。 8:在国家人才网主办的国家网络招聘大会中纳入资料，供国家高端企业选择人才选择实体注册公司办理，更放心，更安全！我们的承诺：可来公司面谈，可签订合同，会陪同客户一起到教育部认证窗口递交认证材料，客户在教育部官方认证查询网站查询到认证通过结果后付款，不成功不收费！学历顾问：微信：176555708

Nuclear Power Economics and Structuring 2024

Massimo Talia

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf

aqil azizi

Harnessing WebAssembly for Real-time Stateless Streaming Pipelines

Christina Lin

Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.

Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf

WENKENLI1

Recently uploaded (20)

An Approach to Detecting Writing Styles Based on Clustering Techniques

Understanding Inductive Bias in Machine Learning

basic-wireline-operations-course-mahmoud-f-radwan.pdf

在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样

Modelagem de um CSTR com reação endotermica.pdf

Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B

Planning Of Procurement o different goods and services

KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions

6th International Conference on Machine Learning & Applications (CMLA 2024)

一比一原版(UofT毕业证)多伦多大学毕业证成绩单如何办理

一比一原版(SFU毕业证)西蒙菲莎大学毕业证成绩单如何办理

digital fundamental by Thomas L.floydl.pdf

Swimming pool mechanical components design.pptx

Heap Sort (SS).ppt FOR ENGINEERING GRADUATES, BCA, MCA, MTECH, BSC STUDENTS

14 Template Contractual Notice - EOT Application

一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理

Nuclear Power Economics and Structuring 2024

Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf

Harnessing WebAssembly for Real-time Stateless Streaming Pipelines

Governing Equations for Fundamental Aerodynamics_Anderson2010.pdf

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

3. Built for Anyone who Uses Data Analysts l Data Scientists l Data Engineers l Data Admins Optimize performance, cost, and scale through automation, control and orchestration of big data workloads. A Single Platform for Any Use Case ETL & Reporting l Ad Hoc Queries l Machine Learning l Streaming l Vertical Apps Open Source Engines, Optimized for the Cloud Native Integration with multiple cloud providers

4. 00Copyright 2017 © Qubole Presto ● High Performance SQL Query Engine for Big Data ● In-memory and pipelined execution model ● Well suited for Adhoc and reporting use-cases ● Gained huge popularity in recent years ● More than 400% growth rate in Qubole in 2018

6. 00Copyright 2017 © Qubole ● Presto is fault Intolerant ● Good chances of Spot Node being interrupted ● Query Failures around the interruption Presto + Spot Nodes? ● High Performance + lower cost ● Obvious match? NO

9. 00Copyright 2017 © Qubole Qubole's journey of Presto on Spot Nodes ● Maximum Spot Percentage ■ Management System for cluster with a mix of node types ■ Spot percentage in autoscaled nodes ■ Stable core size ■ Appropriate Fallbacks ■ Spot Rebalancer ■ More reliable than 100% spot cluster

11. 00Copyright 2017 © Qubole Qubole's journey of Presto on Spot Nodes ● Spot Termination Notification (STN) ■ 2 minutes notice of interruption ■ STN Handling ■ Node Blacklisting ■ New node bringup ■ Spot rebalancer ■ Solves for short queries

12. 00Copyright 2017 © Qubole Qubole's journey of Presto on Spot Nodes ● Smart Query Retry ■ Cluster-aware Retry System ■ Provided through the Presto Server ● Smart Query Retry Requirements ■ Avoid unnecessary retries ■ Wait for the rollback of failed query ■ Should not retry if partial results were returned ■ Retry should be transparent to Presto clients

13. 00Copyright 2017 © Qubole Case Study ● A service of several Presto clusters ● Each cluster configured with min=15, max=25 and max 100% spot nodes in autoscaled nodes ● Metrics collected around queries, nodes, ec2 events, query retries, etc ● 2-months sample period

16. 00Copyright 2017 © Qubole Conclusion ● Spot nodes save considerable costs ● AWS provides some features to minimize impact of spot interruptions ● Smart Retries are needed to reliably leverage spot in Presto

17. Thanks and please Rate today’s session Session page on conference website O’Reilly Events App

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019

Similar to Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019 (20)

More from Shubham Tagra

More from Shubham Tagra (9)

Recently uploaded

Recently uploaded (20)

Cost Effective Presto on AWS with Spot Nodes - Strata SF 2019