This short presentation shows one of ways to to integrate Cerberus and PySpark. It was initially given at Paris.py meetup (https://www.meetup.com/Paris-py-Python-Django-friends/events/264404036/)
Beyond SQL: Speeding up Spark with DataFramesDatabricks
This document summarizes Spark SQL and DataFrames in Spark. It notes that Spark SQL is part of the core Spark distribution and allows running SQL and HiveQL queries. DataFrames provide a way to select, filter, aggregate and plot structured data like in R and Pandas. DataFrames allow writing less code through a high-level API and reading less data by using optimized formats and partitioning. The optimizer can optimize queries across functions and push down predicates to read less data. This allows creating and running Spark programs faster.
Tutorial on developing a Solr search component pluginsearchbox-com
In this set of slides we give a step by step tutorial on how to develop a fully functional solr search component plugin. Additionally we provide links to full source code which can be used as a template to rapidly start creating your own search components.
Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
Azure Cosmos DB is Microsoft's globally distributed, multi-model database service that supports multiple APIs such as SQL, Cassandra, MongoDB, Gremlin and Azure Table. It allows storing entities with automatic partitioning and provides automatic online backups every 4 hours with the latest 2 backups stored. The Azure Cosmos DB change feed and Data Migration Tool allow importing and exporting data for backups. An emulator is also available for trying Cosmos DB locally without an Azure account.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
This document discusses Oracle's machine learning capabilities. It provides an overview of the types of machine learning algorithms available in Oracle such as classification, clustering, regression, and time series analysis. It also describes how machine learning can be used directly from SQL and integrated with Oracle Autonomous Database and Oracle Database to build and deploy models. New algorithms like XGBoost and features for enhanced prediction are highlighted.
This document discusses installing and running Oracle GoldenGate 19c on Docker. It provides an overview of Oracle GoldenGate, describes how to build an Oracle GoldenGate image on Docker, and outlines two options for launching an Oracle GoldenGate container - using a provided script or a manual build. Key steps include downloading the Oracle GoldenGate installation files, building the Docker image using the files, and running the GoldenGate container to enable data replication functionality on Docker.
Beyond SQL: Speeding up Spark with DataFramesDatabricks
This document summarizes Spark SQL and DataFrames in Spark. It notes that Spark SQL is part of the core Spark distribution and allows running SQL and HiveQL queries. DataFrames provide a way to select, filter, aggregate and plot structured data like in R and Pandas. DataFrames allow writing less code through a high-level API and reading less data by using optimized formats and partitioning. The optimizer can optimize queries across functions and push down predicates to read less data. This allows creating and running Spark programs faster.
Tutorial on developing a Solr search component pluginsearchbox-com
In this set of slides we give a step by step tutorial on how to develop a fully functional solr search component plugin. Additionally we provide links to full source code which can be used as a template to rapidly start creating your own search components.
Machine learning is overhyped nowadays. There is a strong belief that this area is exclusively for data scientists with a deep mathematical background that leverage Python (scikit-learn, Theano, Tensorflow, etc.) or R ecosystem and use specific tools like Matlab, Octave or similar. Of course, there is a big grain of truth in this statement, but we, Java engineers, also can take the best of machine learning universe from an applied perspective by using our native language and familiar frameworks like Apache Spark. During this introductory presentation, you will get acquainted with the simplest machine learning tasks and algorithms, like regression, classification, clustering, widen your outlook and use Apache Spark MLlib to distinguish pop music from heavy metal and simply have fun.
Source code: https://github.com/tmatyashovsky/spark-ml-samples
Design by Yarko Filevych: http://filevych.com/
Azure Cosmos DB is Microsoft's globally distributed, multi-model database service that supports multiple APIs such as SQL, Cassandra, MongoDB, Gremlin and Azure Table. It allows storing entities with automatic partitioning and provides automatic online backups every 4 hours with the latest 2 backups stored. The Azure Cosmos DB change feed and Data Migration Tool allow importing and exporting data for backups. An emulator is also available for trying Cosmos DB locally without an Azure account.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
Spark Summit EU 2015: Lessons from 300+ production usersDatabricks
At Databricks, we have a unique view into over a hundred different companies trying out Spark for development and production use-cases, from their support tickets and forum posts. Having seen so many different workflows and applications, some discernible patterns emerge when looking at common performance and scalability issues that our users run into. This talk will discuss some of these common common issues from an engineering and operations perspective, describing solutions and clarifying misconceptions.
This document discusses Oracle's machine learning capabilities. It provides an overview of the types of machine learning algorithms available in Oracle such as classification, clustering, regression, and time series analysis. It also describes how machine learning can be used directly from SQL and integrated with Oracle Autonomous Database and Oracle Database to build and deploy models. New algorithms like XGBoost and features for enhanced prediction are highlighted.
This document discusses installing and running Oracle GoldenGate 19c on Docker. It provides an overview of Oracle GoldenGate, describes how to build an Oracle GoldenGate image on Docker, and outlines two options for launching an Oracle GoldenGate container - using a provided script or a manual build. Key steps include downloading the Oracle GoldenGate installation files, building the Docker image using the files, and running the GoldenGate container to enable data replication functionality on Docker.
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
The document discusses how to analyze and tune queries for better performance in MySQL. It covers topics like cost-based query optimization in MySQL, tools for monitoring, analyzing and tuning queries, data access and index selection, the join optimizer, subqueries, sorting, and influencing the optimizer. The program agenda outlines these topics and their order.
Top 65 SQL Interview Questions and Answers | EdurekaEdureka!
** MYSQL DBA Certification Training https://www.edureka.co/mysql-dba **
This Edureka PPT on Top 65 SQL Interview Question and Answers will help you to prepare yourself for Database Administrators Interviews. It covers questions for beginners, intermediate and experienced professionals.
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
New Features for Multitenant in Oracle Database 21cMarkus Flechtner
Oracle Database 21c introduces several new features for multitenant databases:
- PDBs can now be upgraded automatically when plugged into a 21c CDB or opened, replaying the upgrade process.
- Resource management is improved with options like mandatory user profiles, per-PDB database resident connection pooling, and Oracle DB Nest for isolating PDBs using Linux namespaces and cgroups.
- Multitenant enhancements for high availability include PDBs being managed as cluster resources and improved PDB-level recovery when using Active Data Guard.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
Attend this session for a technical deep dive about RDS Postgres and Aurora Postgres. Come hear from Mark Porter, the General Manager of Aurora PostgreSQL and RDS at AWS, as he covers service specific use cases and applications within the AWS worldwide public sector community. Learn More: https://aws.amazon.com/government-education/
PL/SQL cursors allow naming and manipulating the result set of a SQL query. There are two types of cursors: implicit and explicit. Implicit cursors are associated with DML statements and queries with INTO clauses, while explicit cursors must be declared, opened, fetched from, and closed. Explicit cursors can be static, using a fixed SQL query, or dynamic, changing the SQL at runtime. Cursors support attributes like %FOUND and %ROWCOUNT and can iterate over query results using a FOR loop.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Presented the "A Cloud Journey - Move to the Oracle Cloud" on behalf of Ricardo Gonzalez during Bulgarian Oracle User Group Spring Conference 2019. This presentation discusses various methods on how to migrate to the Oracle Cloud and provides recommendations as to which tool to use (and where to find it) especially assuming that Zero Downtime Migration is desired, for which the new Zero Downtime Migration tool is described and discussed in detail. More information: http://www.oracle.com/goto/move
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB
“Why is MongoDB so slow?” you may ask yourself on occasion. You’ve created indexes, you’ve learned how to use the aggregation pipeline. What the heck? Could it be your queries? This talk will outline what tools are at your disposal (both in MongoDB Atlas and in MongoDB server) to identify inefficient queries.
This document discusses Presto, an interactive SQL query engine for big data. It describes how Presto is optimized to quickly query data stored in Parquet format at Uber. Key optimizations for Parquet include nested column pruning, columnar reads, predicate pushdown, dictionary pushdown, and lazy reads. Benchmark results show these optimizations improve Presto query performance. The document also provides an overview of Uber's analytics infrastructure, applications of Presto, and ongoing work to further optimize Presto and Hadoop.
The document outlines an Oracle Database Administration competency roadmap. It lists various certifications, technologies, and skills grouped under headings like Tuning and Backup/Recovery, Programming, Server, Security/Methodology. The roadmap provides a path for administrators to develop expertise in areas like database installation, configuration, performance tuning, programming, security, and emerging technologies.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
Smart Join Algorithms for Fighting Skew at ScaleDatabricks
Consumer apps like Yelp generate log data at huge scale, and often this is distributed according to a power law, where a small number of users, businesses, locations, or pages are associated with a disproportionately large amount of data. This kind of data skew can cause problems for distributed algorithms, especially joins, where all the rows with the same key must be processed by the same executor. Even just a single over-represented entity can cause a whole job to slow down or fail. One approach to this problem is to remove outliers before joining, and this might be fine when training a machine learning model, but sometimes you need to retain all the data. Thankfully, there are a few tricks you can use to counteract the negative effects of skew while joining, by artificially redistributing data across more machines. This talk will walk through some of them, with code examples.
AST - the only true tool for building JavaScriptIngvar Stepanyan
The document discusses working with code abstract syntax trees (ASTs). It provides examples of parsing code into ASTs using libraries like Esprima, querying ASTs using libraries like grasp-equery, constructing and transforming ASTs, and generating code from ASTs. It introduces aster, an AST-based code builder that allows defining reusable AST transformations as plugins and integrating AST-based builds into generic build systems like Grunt and Gulp. Aster aims to improve on file-based builders by working directly with ASTs in a streaming fashion.
Building an api using golang and postgre sql v1.0Frost
This document provides instructions for building a REST API using Golang and PostgreSQL. It discusses setting up the PostgreSQL database, defining a data structure in Golang, and implementing CRUD operations through API endpoints. Key steps include connecting to the database, querying and executing SQL statements to get, add, update and delete record data, and encoding responses as JSON. The API routes and handlers are defined, CORS is enabled, and the server is started to execute the API.
How to Analyze and Tune MySQL Queries for Better Performanceoysteing
The document discusses how to analyze and tune queries for better performance in MySQL. It covers topics like cost-based query optimization in MySQL, tools for monitoring, analyzing and tuning queries, data access and index selection, the join optimizer, subqueries, sorting, and influencing the optimizer. The program agenda outlines these topics and their order.
Top 65 SQL Interview Questions and Answers | EdurekaEdureka!
** MYSQL DBA Certification Training https://www.edureka.co/mysql-dba **
This Edureka PPT on Top 65 SQL Interview Question and Answers will help you to prepare yourself for Database Administrators Interviews. It covers questions for beginners, intermediate and experienced professionals.
Follow us to never miss an update in the future.
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Organizations need to perform increasingly complex analysis on data — streaming analytics, ad-hoc querying, and predictive analytics — in order to get better customer insights and actionable business intelligence. Apache Spark has recently emerged as the framework of choice to address many of these challenges. In this session, we show you how to use Apache Spark on AWS to implement and scale common big data use cases such as real-time data processing, interactive data science, predictive analytics, and more. We will talk about common architectures, best practices to quickly create Spark clusters using Amazon EMR, and ways to integrate Spark with other big data services in AWS.
Learning Objectives:
• Learn why Spark is great for ad-hoc interactive analysis and real-time stream processing.
• How to deploy and tune scalable clusters running Spark on Amazon EMR.
• How to use EMR File System (EMRFS) with Spark to query data directly in Amazon S3.
• Common architectures to leverage Spark with Amazon DynamoDB, Amazon Redshift, Amazon Kinesis, and more.
The document compares the query execution plans produced by Apache Hive and PostgreSQL. It shows that Hive's old-style execution plans are overly verbose and difficult to understand, providing many low-level details across multiple stages. In contrast, PostgreSQL's plans are more concise and readable, showing the logical query plan in a top-down manner with actual table names and fewer lines of text. The document advocates for Hive to adopt a simpler execution plan format similar to PostgreSQL's.
Apache Spark™ is a fast and general engine for large-scale data processing. Spark is written in Scala and runs on top of JVM, but Python is one of the officially supported languages. But how does it actually work? How can Python communicate with Java / Scala? In this talk, we’ll dive into the PySpark internals and try to understand how to write and test high-performance PySpark applications.
New Features for Multitenant in Oracle Database 21cMarkus Flechtner
Oracle Database 21c introduces several new features for multitenant databases:
- PDBs can now be upgraded automatically when plugged into a 21c CDB or opened, replaying the upgrade process.
- Resource management is improved with options like mandatory user profiles, per-PDB database resident connection pooling, and Oracle DB Nest for isolating PDBs using Linux namespaces and cgroups.
- Multitenant enhancements for high availability include PDBs being managed as cluster resources and improved PDB-level recovery when using Active Data Guard.
"The common use cases of Spark SQL include ad hoc analysis, logical warehouse, query federation, and ETL processing. Spark SQL also powers the other Spark libraries, including structured streaming for stream processing, MLlib for machine learning, and GraphFrame for graph-parallel computation. For boosting the speed of your Spark applications, you can perform the optimization efforts on the queries prior employing to the production systems. Spark query plans and Spark UIs provide you insight on the performance of your queries. This talk discloses how to read and tune the query plans for enhanced performance. It will also cover the major related features in the recent and upcoming releases of Apache Spark.
"
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
This document summarizes a presentation on Spark SQL and its capabilities. Spark SQL allows users to run SQL queries on Spark, including HiveQL queries with UDFs, UDAFs, and SerDes. It provides a unified interface for reading and writing data in various formats. Spark SQL also allows users to express common operations like selecting columns, joining data, and aggregation concisely through its DataFrame API. This reduces the amount of code users need to write compared to lower-level APIs like RDDs.
Slides for Data Syndrome one hour course on PySpark. Introduces basic operations, Spark SQL, Spark MLlib and exploratory data analysis with PySpark. Shows how to use pylab with Spark to create histograms.
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017Amazon Web Services
Attend this session for a technical deep dive about RDS Postgres and Aurora Postgres. Come hear from Mark Porter, the General Manager of Aurora PostgreSQL and RDS at AWS, as he covers service specific use cases and applications within the AWS worldwide public sector community. Learn More: https://aws.amazon.com/government-education/
PL/SQL cursors allow naming and manipulating the result set of a SQL query. There are two types of cursors: implicit and explicit. Implicit cursors are associated with DML statements and queries with INTO clauses, while explicit cursors must be declared, opened, fetched from, and closed. Explicit cursors can be static, using a fixed SQL query, or dynamic, changing the SQL at runtime. Cursors support attributes like %FOUND and %ROWCOUNT and can iterate over query results using a FOR loop.
Video of the presentation can be seen here: https://www.youtube.com/watch?v=uxuLRiNoDio
The Data Source API in Spark is a convenient feature that enables developers to write libraries to connect to data stored in various sources with Spark. Equipped with the Data Source API, users can load/save data from/to different data formats and systems with minimal setup and configuration. In this talk, we introduce the Data Source API and the unified load/save functions built on top of it. Then, we show examples to demonstrate how to build a data source library.
Presented the "A Cloud Journey - Move to the Oracle Cloud" on behalf of Ricardo Gonzalez during Bulgarian Oracle User Group Spring Conference 2019. This presentation discusses various methods on how to migrate to the Oracle Cloud and provides recommendations as to which tool to use (and where to find it) especially assuming that Zero Downtime Migration is desired, for which the new Zero Downtime Migration tool is described and discussed in detail. More information: http://www.oracle.com/goto/move
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB
“Why is MongoDB so slow?” you may ask yourself on occasion. You’ve created indexes, you’ve learned how to use the aggregation pipeline. What the heck? Could it be your queries? This talk will outline what tools are at your disposal (both in MongoDB Atlas and in MongoDB server) to identify inefficient queries.
This document discusses Presto, an interactive SQL query engine for big data. It describes how Presto is optimized to quickly query data stored in Parquet format at Uber. Key optimizations for Parquet include nested column pruning, columnar reads, predicate pushdown, dictionary pushdown, and lazy reads. Benchmark results show these optimizations improve Presto query performance. The document also provides an overview of Uber's analytics infrastructure, applications of Presto, and ongoing work to further optimize Presto and Hadoop.
The document outlines an Oracle Database Administration competency roadmap. It lists various certifications, technologies, and skills grouped under headings like Tuning and Backup/Recovery, Programming, Server, Security/Methodology. The roadmap provides a path for administrators to develop expertise in areas like database installation, configuration, performance tuning, programming, security, and emerging technologies.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Dynamic Partition Pruning in Apache SparkDatabricks
In data analytics frameworks such as Spark it is important to detect and avoid scanning data that is irrelevant to the executed query, an optimization which is known as partition pruning. Dynamic partition pruning occurs when the optimizer is unable to identify at parse time the partitions it has to eliminate. In particular, we consider a star schema which consists of one or multiple fact tables referencing any number of dimension tables. In such join operations, we can prune the partitions the join reads from a fact table by identifying those partitions that result from filtering the dimension tables. In this talk we present a mechanism for performing dynamic partition pruning at runtime by reusing the dimension table broadcast results in hash joins and we show significant improvements for most TPCDS queries.
This was a short introduction to Scala programming language.
me and my colleague lectured these slides in Programming Language Design and Implementation course in K.N. Toosi University of Technology.
Smart Join Algorithms for Fighting Skew at ScaleDatabricks
Consumer apps like Yelp generate log data at huge scale, and often this is distributed according to a power law, where a small number of users, businesses, locations, or pages are associated with a disproportionately large amount of data. This kind of data skew can cause problems for distributed algorithms, especially joins, where all the rows with the same key must be processed by the same executor. Even just a single over-represented entity can cause a whole job to slow down or fail. One approach to this problem is to remove outliers before joining, and this might be fine when training a machine learning model, but sometimes you need to retain all the data. Thankfully, there are a few tricks you can use to counteract the negative effects of skew while joining, by artificially redistributing data across more machines. This talk will walk through some of them, with code examples.
AST - the only true tool for building JavaScriptIngvar Stepanyan
The document discusses working with code abstract syntax trees (ASTs). It provides examples of parsing code into ASTs using libraries like Esprima, querying ASTs using libraries like grasp-equery, constructing and transforming ASTs, and generating code from ASTs. It introduces aster, an AST-based code builder that allows defining reusable AST transformations as plugins and integrating AST-based builds into generic build systems like Grunt and Gulp. Aster aims to improve on file-based builders by working directly with ASTs in a streaming fashion.
Building an api using golang and postgre sql v1.0Frost
This document provides instructions for building a REST API using Golang and PostgreSQL. It discusses setting up the PostgreSQL database, defining a data structure in Golang, and implementing CRUD operations through API endpoints. Key steps include connecting to the database, querying and executing SQL statements to get, add, update and delete record data, and encoding responses as JSON. The API routes and handlers are defined, CORS is enabled, and the server is started to execute the API.
This document discusses metaprogramming in C++ and describes an RPC framework implemented using C++ templates and metaprogramming. It defines an interface called calc_interface for a calculator RPC with functions like add and signals like expr_evaluated. It shows how to define the interface using meta_functions, and how a client can invoke functions on the server while handling serialization, transport, and response deserialization. Key aspects covered include defining the interface, invoking functions from the client, argument checking, and verifying functions exist in the interface.
The document discusses using the Tornado web framework to develop RESTful APIs. It provides examples of implementing RESTful APIs in Tornado, including handling HTTP verbs, returning JSON/JSONP responses, handling exceptions, scraping web pages to monitor server status, and notifying subscribers via push notifications when status changes. Other topics mentioned include internationalization, cron jobs, and related resources for Tornado and RESTful APIs.
This document discusses using Zend Framework for building web applications. It describes how Zend_Application provides dependency injection and configuration without requiring objects. It also covers using Zend_Db for database access, Zend_Controller for routing, and Zend_Translate for internationalization. Validation is discussed, including using Zend_Validate with Zend_Translate to internationalize error messages.
From mysql to MongoDB(MongoDB2011北京交流会)Night Sailer
The document summarizes differences between MySQL and MongoDB data types and operations. MongoDB uses BSON for data types rather than separate numeric, text and blob types. It supports embedded documents and arrays. Unlike MySQL, MongoDB does not have tables or rows, but collections and documents. Operations like insert, update, find, sort and index are discussed as alternatives to SQL equivalents.
The battle of Protractor and Cypress - RunIT Conference 2019Ludmila Nesvitiy
The document compares the testing frameworks Protractor and Cypress. It discusses their installation, onboarding, configuration, APIs, parallelization capabilities, debugging techniques, supported test types including API, unit and e2e tests. It also covers project size, performance when run on Chrome visible and headless modes, support for multiple platforms and various reporting options. Useful links are also provided for references.
Altitude NY 2018: Leveraging Log Streaming to Build the Best Dashboards, EverFastly
If knowing is half the battle, having the most information available is the best way to win. Using real-time log streaming and a knowledge of the data passing through the system, metrics can provide more depth and breadth in to the goings on requests as they pass through various parts of the stack. This session will cover the difference between logging and metrics, writing JSON and Influx Line Protocol in VCL, and building out dashboards to give deeper insights (and more importantly, alerting) on requests and responses at the edge.
ManageIQ currently runs on Ruby on Rails 3. Aaron "tenderlove" Patterson presents his effort to migrate to RoR 4, which entails some changes in the code to take advantage of the latest advances in RoR.
For more on ManageIQ, see http://manageiq.org/
The document discusses moving a Slack commands application from a single Express app hosting all commands to individual serverless functions (Lambda) per command. It provides examples of the code structure for a sample "user stats" command function, how it is executed by AWS Lambda, and how CloudFormation templates are used to define and deploy the Lambda function and its resources to AWS.
Apache Spark in your likeness - low and high level customizationBartosz Konieczny
User-defined features and session extensions in Apache Spark allow for high-level and low-level customization. High-level customization includes user-defined types (UDT), user-defined functions (UDF), and user-defined aggregate functions (UDAF). Low-level customization involves extensions to the analyzer rules, optimizations, physical execution, and more. The document provides examples of UDT, UDF, and UDAF and discusses how they allow incorporating custom logic into Spark applications similar to stored procedures in databases.
This document provides an overview of various JavaScript concepts and techniques, including:
- Prototypal inheritance allows objects in JavaScript to inherit properties from other objects. Functions can act as objects and have a prototype property for inheritance.
- Asynchronous code execution in JavaScript is event-driven. Callbacks are assigned as event handlers to execute code when an event occurs.
- Scope and closures - variables are scoped to functions. Functions create closures where they have access to variables declared in their parent functions.
- Optimization techniques like event delegation and requestAnimationFrame can improve performance of event handlers and animations.
Postman is a tool for designing, sharing and testing APIs between a group of collaborators that range from the API developers down to the final clients, be them mobile apps or web apps.
This presentation focuses on using Postman's advanced free features, with a special focus on testing.
I have linked an example collection which I refer to several times during the presentation.
Section 1 - Fundamentals
Environments, variables, collections, and workspaces
Roles, VCS
Section 2 - Scripts & Testing
Pre request scripts and tests
Scopes
Pass data between requests
Section 3 - Integrated testing
Collection runners: read data from files, workflows
Monitors
CD/CI integration with Newman
Section 4 - More!
Documentation
Mock server
Integrations
This document provides an overview of DataMapper, an object-relational mapper (ORM) library for Ruby applications. It summarizes DataMapper's main features such as associations, migrations, database adapters, naming conventions, validations, custom types and stores. The document also provides examples of how to use DataMapper with different databases, import/export data, and validate models for specific contexts.
OSX/Flashback
El sistema operativo Apple OS X, al igual que todos los sistemas operativos, puede convertirse en una víctima de software malicioso. Antes de la aparición de OSX/Flashback, hubo varios casos documentados de malware dirigido a OS X; pero hasta ahora, OSX/Flashback fue el que cobró la mayor cantidad de víctimas. En este artículo se describen las características técnicas más interesantes de la amenaza, en especial el método utilizado para espiar las comunicaciones de red y los algoritmos para la generación dinámica de nombres de dominio. También se incluye una línea de tiempo con los puntos más importantes del malware, cuyo ciclo de vida persistió durante tantos meses.
This document discusses cross-platform support and push notifications in Windows Azure Mobile Services. It explains how to send push notifications to different device platforms including Windows Store, Windows Phone, iOS, and Android. It also discusses using service filters and delegating handlers to intercept requests and responses for custom processing like adding versioning information.
Similar to Using Cerberus and PySpark to validate semi-structured datasets (20)
Impartiality as per ISO /IEC 17025:2017 StandardMuhammadJazib15
This document provides basic guidelines for imparitallity requirement of ISO 17025. It defines in detial how it is met and wiudhwdih jdhsjdhwudjwkdbjwkdddddddddddkkkkkkkkkkkkkkkkkkkkkkkwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwioiiiiiiiiiiiii uwwwwwwwwwwwwwwwwhe wiqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq gbbbbbbbbbbbbb owdjjjjjjjjjjjjjjjjjjjj widhi owqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq uwdhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhwqiiiiiiiiiiiiiiiiiiiiiiiiiiiiw0pooooojjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjjj whhhhhhhhhhh wheeeeeeee wihieiiiiii wihe
e qqqqqqqqqqeuwiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiqw dddddddddd cccccccccccccccv s w c r
cdf cb bicbsad ishd d qwkbdwiur e wetwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwwww w
dddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddddfffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffffw
uuuuhhhhhhhhhhhhhhhhhhhhhhhhe qiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii iqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccc bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbu uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuum
m
m mmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmm m i
g i dijsd sjdnsjd ndjajsdnnsa adjdnawddddddddddddd uw
Digital Twins Computer Networking Paper Presentation.pptxaryanpankaj78
A Digital Twin in computer networking is a virtual representation of a physical network, used to simulate, analyze, and optimize network performance and reliability. It leverages real-time data to enhance network management, predict issues, and improve decision-making processes.
Generative AI Use cases applications solutions and implementation.pdfmahaffeycheryld
Generative AI solutions encompass a range of capabilities from content creation to complex problem-solving across industries. Implementing generative AI involves identifying specific business needs, developing tailored AI models using techniques like GANs and VAEs, and integrating these models into existing workflows. Data quality and continuous model refinement are crucial for effective implementation. Businesses must also consider ethical implications and ensure transparency in AI decision-making. Generative AI's implementation aims to enhance efficiency, creativity, and innovation by leveraging autonomous generation and sophisticated learning algorithms to meet diverse business challenges.
https://www.leewayhertz.com/generative-ai-use-cases-and-applications/
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELijaia
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Determination of Equivalent Circuit parameters and performance characteristic...pvpriya2
Includes the testing of induction motor to draw the circle diagram of induction motor with step wise procedure and calculation for the same. Also explains the working and application of Induction generator
10. PySpark integration - extended Cerberus
UNKNOWN_NETWORK = ErrorDefinition(333, 'network_exists')
class ExtendedValidator(Validator):
def _validate_networkexists(self, allowed_values, field, value):
if value not in allowed_values:
self._error(field, UNKNOWN_NETWORK, {})
class ErrorCodesHandler(SchemaErrorHandler):
def __call__(self, validation_errors):
def concat_path(document_path):
return '.'.join(document_path)
output_errors = {}
for error in validation_errors:
if error.is_group_error:
for child_error in error.child_errors:
output_errors[concat_path(child_error.document_path)] =
child_error.code
else:
output_errors[concat_path(error.document_path)] = error.code
return output_errors
extended
Validator,
custom
validation
rule
not the
same as
previously
custom
output for
#errors
call
11. PySpark integration - .mapPartitions function
def check_for_errors(rows):
validator = ExtendedValidator(schema,
error_handler=ErrorCodesHandler())
def default_dictionary():
return defaultdict(int)
errors = defaultdict(default_dictionary)
for row in rows:
validation_result =
validator.validate(row.asDict(recursive=True),
normalize=False)
if not validation_result:
for error_field, error_code in
validator.errors.items():
errors[error_field][error_code] += 1
return [(k, dict(v)) for k, v in errors.items()]
disabled
normalization
SchemaErrorHandler ⇒ child of BasicErrorHandler; BasicErrorHandler child of BaseErrorHandler
define what is the type (datetime, int, string, ..)
mapping ⇒ type, items, empty ⇒ validator types
https://docs.python-cerberus.org/en/stable/validation-rules.html
explain why normalize=False
calls __normalize_mapping
* you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}})
>>> v.normalized({'foo': 0})
{'bar': 0}
* later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified
* applies the defaults also
* can also call a coercion, a callable that can for instance convert the type of a field
Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated
check if validator.clear_caches can help
explain why normalize=False
calls __normalize_mapping
* you can specify a new attribute - normalization will rename it if needed >>> v = Validator({'foo': {'rename': 'bar'}})
>>> v.normalized({'foo': 0})
{'bar': 0}
* later it purges all unknown fields if Validator({'foo': {'type': 'string'}}, purge_unknown=True) is specified
* applies the defaults also
* can also call a coercion, a callable that can for instance convert the type of a field
Coercion allows you to apply a callable (given as object or the name of a custom coercion method) to a value before the document is validated
check if validator.clear_caches can help