Jeff Geerling (geerlingguy) gives an overview of the Ansible 2.0.0 and Ansible Galaxy 2.0.0 releases in early 2016.
Jeff Geerling is the author of Ansible for DevOps (www.ansiblefordevops.com) and helps organize the St. Louis Ansible meetup group.
Harnessing Spark and Cassandra with GroovySteve Pember
This talk is an introduction to a powerful combination in the big data space: Apache Spark and Cassandra. Spark is a cluster-computing framework that allows users to perform calculations against resilient in-memory datasets using a functional programming interface. Cassandra is a linearly scalable, fault tolerant, decentralized datastore. These two technologies are complicated, but integrate well and provide such a level of utility that whole companies have formed around them.
In this talk we’ll learn how Spark and Cassandra can be leveraged within your Groovy Application: Spark normally asks for a Scala environment. We’ll talk about Spark and Cassandra from a high level and walk through code examples. We’ll discuss the pitfalls of working with these technologies - like modeling your data appropriately to ensure even distribution in Cassandra and general packaging woes with Spark - and ways to avoid them. Finally, we’ll explore how we at ThirdChannel are using these technologies.
The need to crunch large amounts of data to extract useful statistics is increasingly common. Using services like Amazon Redshift and Amazon Elastic MapReduce, we will show how you can process log data to produce helpful reports and give your analysts the tools to find useful data. We will dive deep into these systems, building a usable example from scratch using the AWS SDK for Ruby.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Jeff Geerling (geerlingguy) gives an overview of the Ansible 2.0.0 and Ansible Galaxy 2.0.0 releases in early 2016.
Jeff Geerling is the author of Ansible for DevOps (www.ansiblefordevops.com) and helps organize the St. Louis Ansible meetup group.
Harnessing Spark and Cassandra with GroovySteve Pember
This talk is an introduction to a powerful combination in the big data space: Apache Spark and Cassandra. Spark is a cluster-computing framework that allows users to perform calculations against resilient in-memory datasets using a functional programming interface. Cassandra is a linearly scalable, fault tolerant, decentralized datastore. These two technologies are complicated, but integrate well and provide such a level of utility that whole companies have formed around them.
In this talk we’ll learn how Spark and Cassandra can be leveraged within your Groovy Application: Spark normally asks for a Scala environment. We’ll talk about Spark and Cassandra from a high level and walk through code examples. We’ll discuss the pitfalls of working with these technologies - like modeling your data appropriately to ensure even distribution in Cassandra and general packaging woes with Spark - and ways to avoid them. Finally, we’ll explore how we at ThirdChannel are using these technologies.
The need to crunch large amounts of data to extract useful statistics is increasingly common. Using services like Amazon Redshift and Amazon Elastic MapReduce, we will show how you can process log data to produce helpful reports and give your analysts the tools to find useful data. We will dive deep into these systems, building a usable example from scratch using the AWS SDK for Ruby.
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. Holden Karau and Joey Echeverria explore how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, and some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose. Holden and Joey demonstrate how to effectively search logs from Apache Spark to spot common problems and discuss options for logging from within your program itself. Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but Holden and Joey look at how to effectively use Spark’s current accumulators for debugging before gazing into the future to see the data property type accumulators that may be coming to Spark in future versions. And in addition to reading logs and instrumenting your program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems. Holden and Joey cover how to quickly use the UI to figure out if certain types of issues are occurring in your job.
The talk will wrap up with Holden trying to get everyone to buy several copies of her new book, High Performance Spark.
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies.
We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts.
We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.
ave time learning on your own. Start Building with React, MongoDB, Express, & Node. The MERN Stack.
Learning a new JavaScript framework is difficult. You can spend weeks learning new concepts. If an online example doesn’t work, you may spend countless hours Googling, searching Stack Overflow and blogs for the solution.
Take the fast track and learn from an experienced Senior Software Engineer and professional instructor!
About this Course
This highly interactive course features a large amount of student labs and hands-on coding. You will be taught how to assemble the complete stack required to build a modern web app using React.js, MongoDB (a NoSQL database) and Express (a framework for web application servers). This course will also cover many other tools that go into building a complete web application: React Router, React-Bootstrap, Redux, Babel, and Webpack.
What You Will Learn
• How to use modern JavaScript features
• Webpack
• Node & Express
• Reading and writing data to a MongoDB database
• Babel
• React
• State Management with Redux
• Mongoose
• And More!
Over the past few years, web-applications have started to play an increasingly important role in our lives. We expect them to be always available and the data to be always fresh. This shift into the realm of real-time data processing is now transitioning to physical devices, and Gartner predicts that the Internet of Things will grow to an installed base of 26 billion units by 2020.
Reactive web-applications are an answer to the new requirements of high-availability and resource efficiency brought by this rapid evolution. On the JVM, a set of new languages and tools has emerged that enable the development of entirely asynchronous request and data handling pipelines. At the same time, container-less application frameworks are gaining increasing popularity over traditional deployment mechanisms.
This talk is going to give you an introduction into one of the most trending reactive web-application stack on the JVM, involving the Scala programming language, the concurrency toolkit Akka and the web-application framework Play. It will show you how functional programming techniques enable asynchronous programming, and how those technologies help to build robust and resilient web-applications.
Ansible is a Configuration Management System that is very simple to use, because of its straightforward and robust model for managing automation and it’s low barrier to entry for ease of use in both development and production.
During OpenStack development, Ansible can be used in conjunction with Vagrant and Devstack to manage complex, multi-node development environments with relative ease.
In this presentation, Juergen Brendel and David Lapsley review Ansible and provide some sample playbooks to get developers up and running quickly. They also describes how to use Ansible, Vagrant, Devstack, and OpenStack to accelerate OpenStack development cycles.
Speech of Nihad Abbasov, Senior Software Engineer at Digital Classifieds, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
How fast is your code? Performance is crucial as your startup grows, and optimizing your application can make a huge impact on user experience. During this talk, you will learn hints, techniques and best practices for improving the overall speed of your Ruby application.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...confluent
Integrating Apache Kafka with other systems in a reliable and scalable way is often a key part of a streaming platform. Fortunately, Apache Kafka includes the Connect API that enables streaming integration both in and out of Kafka. Like any technology, understanding its architecture and deployment patterns is key to successful use, as is knowing where to go looking when things aren’t working. This talk will discuss the key design concepts within Kafka Connect and the pros and cons of standalone vs distributed deployment modes. We’ll do a live demo of building pipelines with Kafka Connect for streaming data in from databases, and out to targets including Elasticsearch. With some gremlins along the way, we’ll go hands-on in methodically diagnosing and resolving common issues encountered with Kafka Connect. The talk will finish off by discussing more advanced topics including Single Message Transforms, and deployment of Kafka Connect in containers.
Apache Camel Introduction & What's in the boxClaus Ibsen
Slides from JavaBin talk in Grimstad Norway, presented by Claus Ibsen in February 2016.
This slide deck is full up to date with latest Apache Camel 2.16.2 release and includes additional slides to present many of the features that Apache Camel provides out of the box.
"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."
Refactoring @ Mindvalley: Smells, Techniques and PatternsTristan Gomez
Every week my team commits really good, clean code. I decided to get the best of the commits and showcase what makes them good, what smells they address, and what techniques they used.
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsTomas Doran
Using puppet when configuring EC2 machines seems a natural fit. However bringing up new machines from a community image with puppet is not trivial and can be slow, and so not useful for auto-scaling.
The cloud also offers a solution to ongoing server maintenance, allowing you to launch fresh instances whenever you upgrade your applications (Immutable or Phoenix servers). However to predictably succeed, you need to freeze the puppet code alongside the application version for deployment.
The solution to these issues is generating custom machine images (AMIs) with your software inlined. This talk will cover Yelp's use of a Packer, Jenkins and Puppet for generating AMIs. This will include how we deal with issues like bootstrapping, getting canonical information about a machine's environment and cluster state at launch time, as well as supporting immutable/phoenix servers in combination with more traditional long lived servers inside our hybrid cloud infrastructure.
Slides from talk on legacy data migration. Includes introduction of Trucker gem and covers common migration issues.
This talk was given by Patrick Crowley and Rob Kaufman at RubyMidwest 2010 in Kansas City, MO.
spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Spark Job Server and Spark as a Query Engine (Spark Meetup 5/14)Evan Chan
This was a talk that Kelvin Chu and I just gave at the SF Bay Area Spark Meetup 5/14 at Palantir Technologies.
We discussed the Spark Job Server (http://github.com/ooyala/spark-jobserver), its history, example workflows, architecture, and exciting future plans to provide HA spark job contexts.
We also discussed the use case of the job server at Ooyala to facilitate fast query jobs using shared RDD and a shared job context, and how we integrate with Apache Cassandra.
ave time learning on your own. Start Building with React, MongoDB, Express, & Node. The MERN Stack.
Learning a new JavaScript framework is difficult. You can spend weeks learning new concepts. If an online example doesn’t work, you may spend countless hours Googling, searching Stack Overflow and blogs for the solution.
Take the fast track and learn from an experienced Senior Software Engineer and professional instructor!
About this Course
This highly interactive course features a large amount of student labs and hands-on coding. You will be taught how to assemble the complete stack required to build a modern web app using React.js, MongoDB (a NoSQL database) and Express (a framework for web application servers). This course will also cover many other tools that go into building a complete web application: React Router, React-Bootstrap, Redux, Babel, and Webpack.
What You Will Learn
• How to use modern JavaScript features
• Webpack
• Node & Express
• Reading and writing data to a MongoDB database
• Babel
• React
• State Management with Redux
• Mongoose
• And More!
Over the past few years, web-applications have started to play an increasingly important role in our lives. We expect them to be always available and the data to be always fresh. This shift into the realm of real-time data processing is now transitioning to physical devices, and Gartner predicts that the Internet of Things will grow to an installed base of 26 billion units by 2020.
Reactive web-applications are an answer to the new requirements of high-availability and resource efficiency brought by this rapid evolution. On the JVM, a set of new languages and tools has emerged that enable the development of entirely asynchronous request and data handling pipelines. At the same time, container-less application frameworks are gaining increasing popularity over traditional deployment mechanisms.
This talk is going to give you an introduction into one of the most trending reactive web-application stack on the JVM, involving the Scala programming language, the concurrency toolkit Akka and the web-application framework Play. It will show you how functional programming techniques enable asynchronous programming, and how those technologies help to build robust and resilient web-applications.
Ansible is a Configuration Management System that is very simple to use, because of its straightforward and robust model for managing automation and it’s low barrier to entry for ease of use in both development and production.
During OpenStack development, Ansible can be used in conjunction with Vagrant and Devstack to manage complex, multi-node development environments with relative ease.
In this presentation, Juergen Brendel and David Lapsley review Ansible and provide some sample playbooks to get developers up and running quickly. They also describes how to use Ansible, Vagrant, Devstack, and OpenStack to accelerate OpenStack development cycles.
Speech of Nihad Abbasov, Senior Software Engineer at Digital Classifieds, at Ruby Meditation 27, Dnipro, 19.05.2019
Slideshare -
Next conference - http://www.rubymeditation.com/
How fast is your code? Performance is crucial as your startup grows, and optimizing your application can make a huge impact on user experience. During this talk, you will learn hints, techniques and best practices for improving the overall speed of your Ruby application.
Announcements and conference materials https://www.fb.me/RubyMeditation
News https://twitter.com/RubyMeditation
Photos https://www.instagram.com/RubyMeditation
The stream of Ruby conferences (not just ours) https://t.me/RubyMeditation
From Zero to Hero with Kafka Connect (Robin Moffat, Confluent) Kafka Summit L...confluent
Integrating Apache Kafka with other systems in a reliable and scalable way is often a key part of a streaming platform. Fortunately, Apache Kafka includes the Connect API that enables streaming integration both in and out of Kafka. Like any technology, understanding its architecture and deployment patterns is key to successful use, as is knowing where to go looking when things aren’t working. This talk will discuss the key design concepts within Kafka Connect and the pros and cons of standalone vs distributed deployment modes. We’ll do a live demo of building pipelines with Kafka Connect for streaming data in from databases, and out to targets including Elasticsearch. With some gremlins along the way, we’ll go hands-on in methodically diagnosing and resolving common issues encountered with Kafka Connect. The talk will finish off by discussing more advanced topics including Single Message Transforms, and deployment of Kafka Connect in containers.
Apache Camel Introduction & What's in the boxClaus Ibsen
Slides from JavaBin talk in Grimstad Norway, presented by Claus Ibsen in February 2016.
This slide deck is full up to date with latest Apache Camel 2.16.2 release and includes additional slides to present many of the features that Apache Camel provides out of the box.
"In this session, Twitter engineer Alex Payne will explore how the popular social messaging service builds scalable, distributed systems in the Scala programming language. Since 2008, Twitter has moved the development of its most critical systems to Scala, which blends object-oriented and functional programming with the power, robust tooling, and vast library support of the Java Virtual Machine. Find out how to use the Scala components that Twitter has open sourced, and learn the patterns they employ for developing core infrastructure components in this exciting and increasingly popular language."
Refactoring @ Mindvalley: Smells, Techniques and PatternsTristan Gomez
Every week my team commits really good, clean code. I decided to get the best of the commits and showcase what makes them good, what smells they address, and what techniques they used.
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsTomas Doran
Using puppet when configuring EC2 machines seems a natural fit. However bringing up new machines from a community image with puppet is not trivial and can be slow, and so not useful for auto-scaling.
The cloud also offers a solution to ongoing server maintenance, allowing you to launch fresh instances whenever you upgrade your applications (Immutable or Phoenix servers). However to predictably succeed, you need to freeze the puppet code alongside the application version for deployment.
The solution to these issues is generating custom machine images (AMIs) with your software inlined. This talk will cover Yelp's use of a Packer, Jenkins and Puppet for generating AMIs. This will include how we deal with issues like bootstrapping, getting canonical information about a machine's environment and cluster state at launch time, as well as supporting immutable/phoenix servers in combination with more traditional long lived servers inside our hybrid cloud infrastructure.
Slides from talk on legacy data migration. Includes introduction of Trucker gem and covers common migration issues.
This talk was given by Patrick Crowley and Rob Kaufman at RubyMidwest 2010 in Kansas City, MO.
spark-bench is an open-source benchmarking tool, and it’s also so much more. spark-bench is a flexible system for simulating, comparing, testing, and benchmarking Spark applications and Spark itself. spark-bench originally began as a benchmarking suite to get timing numbers on very specific algorithms mostly in the machine learning domain. Since then it has morphed into a highly configurable and flexible framework suitable for many use cases. This talk will discuss the high level design and capabilities of spark-bench before walking through some major, practical use cases. Use cases include, but are certainly not limited to: regression testing changes to Spark; comparing performance of different hardware and Spark tuning options; simulating multiple notebook users hitting a cluster at the same time; comparing parameters of a machine learning algorithm on the same set of data; providing insight into bottlenecks through use of compute-intensive and i/o-intensive workloads; and, yes, even benchmarking. In particular this talk will address the use of spark-bench in developing new features features for Spark core.
Operational Tips For Deploying Apache SparkDatabricks
Spark is providing a way to make big data applications easier to work with, but understanding how to actually deploy the platform can be quite confusing. This talk will present operational tips and best practices based on supporting our (Databricks) customers with Spark in production. We will discuss how your choice of storage and overall pipeline design influence performance. We will review Spark’s configuration subsystem and discuss which configuration properties are relevant to you. We’ll also review common misconfigurations that prevent users from getting the most of their Spark deployment. Finally, I’ll discuss frequently encountered issues working with customer environments and present debugging techniques to get to the root cause. This talk should help answer the following questions: How should I deploy my Spark application (cluster size, storage format, etc)? How can I improve the performance of my Spark application? What’s causing my Spark application to crash?
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
At the sold-out Spark & Machine Learning Meetup in Brussels on October 27, 2016, Nick Pentreath of the Spark Technology Center teamed up with Jean-François Puget of IBM Analytics to deliver a talk called Creating an end-to-endRecommender System with Apache Spark and Elasticsearch.
Jean-François and Nick started with a look at the workflow for recommender systems and machine learning, then moved on to data modeling and using Spark ML for collaborative filtering. They closed with a discussion of deploying and scoring the recommender models, including a demo.
Teaching Apache Spark Clusters to Manage Their Workers Elastically: Spark Sum...Spark Summit
Devops engineers have applied a great deal of creativity and energy to invent tools that automate infrastructure management, in the service of deploying capable and functional applications. For data-driven applications running on Apache Spark, the details of instantiating and managing the backing Spark cluster can be a distraction from focusing on the application logic. In the spirit of devops, automating Spark cluster management tasks allows engineers to focus their attention on application code that provides value to end-users.
Using Openshift Origin as a laboratory, we implemented a platform where Apache Spark applications create their own clusters and then dynamically manage their own scale via host-platform APIs. This makes it possible to launch a fully elastic Spark application with little more than the click of a button.
We will present a live demo of turn-key deployment for elastic Apache Spark applications, and share what we’ve learned about developing Spark applications that manage their own resources dynamically with platform APIs.
The audience for this talk will be anyone looking for ways to streamline their Apache Spark cluster management, reduce the workload for Spark application deployment, or create self-scaling elastic applications. Attendees can expect to learn about leveraging APIs in the Kubernetes ecosystem that enable application deployments to manipulate their own scale elastically.
Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks
This talk from 2015 Spark Summit East covers 3 applications built with Apache Spark:
1. Web Logs Analysis: Basic Data Pipeline - Spark & Spark SQL
2. Wikipedia Dataset Analysis: Machine Learning
3. Facebook API: Graph Algorithms
Faster Data Integration Pipeline Execution using Spark-JobserverDatabricks
As you may already know, the open-source Spark Job Server offers a powerful platform for managing Spark jobs, jars, and contexts, turning Spark into a much more convenient and easy-to-use service. The Spark-Jobserver can keep Spark context warmed up and readily available for accepting new jobs. At Informatica we are leveraging the Spark-Jobserver offerings to solve the data-visualization use-case.
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Databricks
Did you know almost every feature of the Spark Cassandra connector can be accessed without even a single Monad! In this talk I’ll demonstrate how you can take advantage of Spark on Cassandra using only the SQL you already know! Learn how to register tables, ETL data, and analyze query plans all from the comfort of your very own JDBC Client. Find out how you can access Cassandra with ease from the BI tool of your choice and take your analysis to the next level. Discover the tricks of debugging and analyzing predicate pushdowns using the Spark SQL Thrift Server. Preview the latest developments of the Spark Cassandra Connector.
Jumpstart on Apache Spark 2.2 on DatabricksDatabricks
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Jump Start on Apache® Spark™ 2.x with Databricks Databricks
Apache Spark 2.0 and subsequent releases of Spark 2.1 and 2.2 have laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop, you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Agenda:
• Overview of Spark Fundamentals & Architecture
• What’s new in Spark 2.x
• Unified APIs: SparkSessions, SQL, DataFrames, Datasets
• Introduction to DataFrames, Datasets and Spark SQL
• Introduction to Structured Streaming Concepts
• Four Hands On Labs
You will use Databricks Community Edition, which will give you unlimited free access to a ~6 GB Spark 2.x local mode cluster. And in the process, you will learn how to create a cluster, navigate in Databricks, explore a couple of datasets, perform transformations and ETL, save your data as tables and parquet files, read from these sources, and analyze datasets using DataFrames/Datasets API and Spark SQL.
Level: Beginner to intermediate, not for advanced Spark users.
Prerequisite: You will need a laptop with Chrome or Firefox browser installed with at least 8 GB. Introductory or basic knowledge Scala or Python is required, since the Notebooks will be in Scala; Python is optional.
Bio:
Jules S. Damji is an Apache Spark Community Evangelist with Databricks. He is a hands-on developer with over 15 years of experience and has worked at leading companies, such as Sun Microsystems, Netscape, LoudCloud/Opsware, VeriSign, Scalix, and ProQuest, building large-scale distributed systems. Before joining Databricks, he was a Developer Advocate at Hortonworks.
Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vad...Provectus
In this demo based talk with live coding, we’ll present a functional typeful framework for developing Apache Spark applications. We’ll walk through the following key topics: – turning unmanageable Spark scripts into typeful Spark Functions – serverless deployment of Spark functions into the cloud – unit testing Spark functions to save cluster resources and developers time – seamless Spark session management between concurrent Spark jobs in exclusive or share modes
A new look on Spark 2 features and Under the hood. We try to look at Apache spark latest release with an examining look, while still loving it, but also criticising it.
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
Apache Spark 2.x has laid the foundation for many new features and functionality. Its main three themes—easier, faster, and smarter—are pervasive in its unified and simplified high-level APIs for Structured data.
In this introductory part lecture and part hands-on workshop you’ll learn how to apply some of these new APIs using Databricks Community Edition. In particular, we will cover the following areas:
Apache Spark Fundamentals & Concepts
What’s new in Spark 2.x
SparkSessions vs SparkContexts
Datasets/Dataframes and Spark SQL
Introduction to Structured Streaming concepts and APIs
The Azure Cognitive Services on Spark: Clusters with Embedded Intelligent Ser...Databricks
We present the Azure Cognitive Services on Spark, a simple and easy to use extension of the SparkML Library to all Azure Cognitive Services. This integration allows Spark Users to embed cloud intelligence directly into their spark computations, enabling a new generation of intelligent applications on Spark. Furthermore, we show that with our new Containerized Cognitive Services, one can embed cloud intelligence directly into the Spark cluster for ultra-low latency, on-prem, and offline applications. We show how using our Integration, one can compose these cognitive services with other services, SQL computations, and Deep Networks to create sophisticated and intelligent heterogenous applications. Moreover, we show how to redeploy these compositions as Restful Services with Spark Serving. We will also explore the architecture of these contributions which leverage HTTP on Spark, a novel integration between Spark with the widely used Hypertext Transfer Protocol (HTTP). This library can integrate any framework into the Spark ecosystem that is capable of communicating through HTTP. Finally, we demonstrate how to use these services to create a large class of intelligent applications such as custom search engines, realtime facial recognition systems, and unsupervised object detectors.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
20 Comprehensive Checklist of Designing and Developing a WebsitePixlogix Infotech
Dive into the world of Website Designing and Developing with Pixlogix! Looking to create a stunning online presence? Look no further! Our comprehensive checklist covers everything you need to know to craft a website that stands out. From user-friendly design to seamless functionality, we've got you covered. Don't miss out on this invaluable resource! Check out our checklist now at Pixlogix and start your journey towards a captivating online presence today.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Full-RAG: A modern architecture for hyper-personalizationZilliz
Mike Del Balso, CEO & Co-Founder at Tecton, presents "Full RAG," a novel approach to AI recommendation systems, aiming to push beyond the limitations of traditional models through a deep integration of contextual insights and real-time data, leveraging the Retrieval-Augmented Generation architecture. This talk will outline Full RAG's potential to significantly enhance personalization, address engineering challenges such as data management and model training, and introduce data enrichment with reranking as a key solution. Attendees will gain crucial insights into the importance of hyperpersonalization in AI, the capabilities of Full RAG for advanced personalization, and strategies for managing complex data integrations for deploying cutting-edge AI solutions.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
Mind map of terminologies used in context of Generative AI
CODAIT/Spark-Bench
1. Simulate, Test, Compare, Exercise, and Yes, Even
Benchmark!
Spark-Bench
Emily May Curtin
The Weather Company
Atlanta, GA
@emilymaycurtin
@ecurtin
"S" + =
#SparkBench
2. Who Am I
• Artist, former documentary film editor
• 2017 Spark Summit East – Spark + Parquet
• Tabs, not spaces (at least for Scala)(where 1 tab = 2 spaces)
• Software Engineer, The Weather Company, an IBM Business
• Current lead developer of CODAIT/spark-bench
#SparkBench
10. Spark-Bench 2015
"SparkBench: A Comprehensive Benchmarking
Suite For In Memory Data Analytics Platform Spark"
Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina
Salapura IBM TJ Watson Research Center - APC 2015
#SparkBench
11. Spark-Bench 2015
"SparkBench: A Comprehensive Benchmarking
Suite For In Memory Data Analytics Platform Spark"
Min Li, Jian Tan, Yandong Wang, Li Zhang, Valentina
Salapura IBM TJ Watson Research Center - APC 2015
#SparkBench
17. #SparkBench
Hi-Bench, etc.
Spark vs. other big data engines
Profilers, Tracers
Fine-grain details of individual
Spark jobs and setups
💀databricks/spark-perf
Spark vs. Spark,
cluster performance
Too high level
Too low level
Juuuust right, sorta
18. As a researcher, I want to run a set
of standard benchmarks with
generated data against one
instance of Spark, one time.
--The user story that CODAIT/spark-bench version 1.0 fulfilled
#SparkBench
21. As a Spark developer, I want to
run regression tests with my
changes in order to provide some
authoritative performance stats
with my PR.
#SparkBench
22. As a Spark developer, I want to
simulate multiple notebook users
hitting the same cluster so that I
can test my changes to the
dynamic allocation algorithm.
Supporting Highly Multitenant Spark Notebook Workloads
Craig Ingram, Brad Kaiser
#SparkBench
23. As an enterprise user of Spark, I
want to make my daily batch
ingest job run faster by tuning
parameters.
#SparkBench
24. As a data scientist, I want to run
several machine learning
algorithms over the same set of
data in order to compare results.
#SparkBench
25. As anybody who has ever had to
make a graph about Spark, I want
to get numbers easily and quickly
so I can focus on making
beautiful graphs.
#SparkBench
27. Structural Details
Old Version ✨New Version✨
Project Design Series of shell scripts calling
individually built jars
Scala project built using SBT
Build/Release Manually/Manually SBT, TravisCI, auto-release to
Github Releases on PR merge
Configuration Scattered in shell script variables Centralized in one config file with
many new capabilities
Parallelism None Three different levels!
Custom
Workloads
Requires writing a new subproject with
all accompanying bash
Implement our abstract class, Bring
Your Own Jar
#SparkBench
28. The Config File (a first glance)
spark-bench = {
spark-submit-config = [{
workload-suites = [
{
descr = "One run of SparkPi and that's it!"
benchmark-output = "console"
workloads = [
{
name = "sparkpi"
slices = 10
}
]
}
]
}]
}
#SparkBench
29. Workloads
• Atomic unit of spark-bench
• Optionally read from disk
• Optionally write results to disk
• All stages of the workload are timed
• Infinitely composable
#SparkBench
30. Workloads
• Machine Learning
– Kmeans
– Logistic Regression
– Other workloads ported from old version
• SQL queries
• Oddballs
– SparkPi
– Sleep
– Cache Test
#SparkBench
31. Workloads: Data Generators
• Graph data generator
• Kmeans data generator
• Linear Regression data generator
• More coming!
#SparkBench
32. What About Other Workloads?
1. Active development, more coming!
2. Contributions welcome J
3. Bring Your Own Workload!
{
name = "custom"
class = "com.seriouscompany.HelloString"
str = ["Hello", "Hi"]
}
#SparkBench
33. Custom Workloads
#SparkBench
{
name = "custom"
class = "com.seriouscompany.OurGiganticIngestDailyIngestJob"
date = "Tues Oct 26 18:58:25 EDT 2017"
source = "s3://buncha-junk/2017-10-26/"
}
{
name = "custom”
class = "com.example.WordGenerator”
output = "console”
rows = 10
cols = 3
word = "Cool stuff!!"
}
ß See our documentation on
custom workloads!
37. Workload
Could be Kmeans,
SparkPi, SQL, Sleep, etc.
Data Generation Workload{
name = "sparkpi"
slices = 500000000
}
{
name = "data-generation-kmeans"
rows = 1000
cols = 24
output = "/tmp/kmeans-data.parquet"
k = 45
}
#SparkBench
38. Data Generation,
Small Set
Data Generation, Large Set
{
name = "data-generation-kmeans"
rows = 100000000
cols = 24
output = "/tmp/large-data.parquet"
}
{
name = "data-generation-kmeans"
rows = 100
cols = 24
output = "/tmp/small-data.parquet"
}
#SparkBench
41. Workload Suites
• Logical group of workloads
• Run workloads serially or in parallel
• Control the repetition of workloads
• Control the output of benchmark results
• Can be composed with other workload suites to
run serially or in parallel
#SparkBench
42. One workload suite
with three workloads
running serially
workload-suites = [ {
descr = "Lol"
parallel = false
repeat = 1
benchmark-output = "console"
workloads = [
{
// Three workloads here
}
]
}]
#SparkBench
43. One workload suite
with three workloads
running in parallel
workload-suites = [ {
descr = "Lol"
parallel = true
repeat = 1
benchmark-output = "console"
workloads = [
{
// Three workloads here
}
]
}]
#SparkBench
44. Three workload suites running serially.
Two have workloads running serially,
one has workloads running in parallel.
suites-parallel = false
workload-suites = [
{
descr = "One"
parallel = false
},
{
descr = "Two"
parallel = false
},
{
descr = "Three"
parallel = true
}]
(config file is abbreviated for space)
#SparkBench
45. Three workload suites running in parallel.
Two have workloads running serially,
one has workloads running in parallel.
suites-parallel = true
workload-suites = [
{
descr = "One"
parallel = false
},
{
descr = "Two"
parallel = false
},
{
descr = "Three"
parallel = true
}]
(config file is abbreviated for space)
#SparkBench
47. Spark-Submit-Config
• Control any parameter present in a spark-submit
script
• Produce and launch multiple spark-submit scripts
• Vary over parameters like executor-mem
• Run against different clusters or builds of Spark
• Can run serially or in parallel
#SparkBench
48. One spark-submit
with one workload
suite with three
workloads running
serially
spark-bench = {
spark-submit-config = [{
spark-args = {
master = "yarn"
}
workload-suites = [{
descr = "One"
benchmark-output = "console"
workloads = [
// . . .
]
}]
}]
}
(config file is abbreviated for space)
#SparkBench
49. Two spark-submits running in
parallel, each with one
workload suite, each with three
workloads running serially
#SparkBench
50. Two spark-submits running serially,
each with one workload suite,
one with three workloads running serially,
one with three workloads running in parallel
#SparkBench
51. One spark-submit with two workload
suites running serially.
The first workload suite generates
data in parallel.
The second workload suite runs
different workloads serially.
The second workload suite repeats.
#SparkBench
55. Example: Parquet vs. CSV
Goal: benchmark two different SQL queries over the
same dataset stored in two different formats
1. Generate a big dataset in CSV format
2. Convert that dataset to Parquet
3. Run first query over both sets
4. Run second query over both sets
5. Make beautiful graphs
#SparkBench
57. workload-suites = [
{
descr = "Generate a dataset, then transform it to Parquet format"
benchmark-output = "hdfs:///tmp/emily/results-data-gen.csv"
parallel = false
workloads = [
{
name = "data-generation-kmeans"
rows = 10000000
cols = 24
output = "hdfs:///tmp/emily/kmeans-data.csv"
},
{
name = "sql"
query = "select * from input"
input = "hdfs:///tmp/emily/kmeans-data.csv"
output = "hdfs:///tmp/emily/kmeans-data.parquet"
}
]
},
// …CONTINUED…
#SparkBench
58. {
descr = "Run two different SQL over two different formats"
benchmark-output = "hdfs:///tmp/emily/results-sql.csv"
parallel = false
repeat = 10
workloads = [
{
name = "sql"
input = [
"hdfs:///tmp/emily/kmeans-data.csv",
"hdfs:///tmp/emily/kmeans-data.parquet"
]
query = [
"select * from input",
"select `0`, `22` from input where `0` < -0.9"
]
cache = false
}
]
}
]
}]
} You made it through the plaintext! Here’s a puppy!#SparkBench
66. Multi-User Notebook
Data Generation,
sample.csv and
GIANT-SET.csv
Workloads,
working with the
sample.csv
Same workloads,
working with
GIANT-SET.csv
{
name = "sleep”
distribution = "poisson”
mean = 30000
}
Sleep
Workload
#SparkBench
67. Examples of Easy Extensions
• Add more "users" (workload suites)
• Have a workload suite that simulates heavy batch
jobs while notebook users attempt to use the
same cluster
• Toggle settings for dynamic allocation, etc.
• Benchmark a notebook user on a heavily
stressed cluster (cluster hammer!)
#SparkBench
73. Installation
1. Grab the latest release from the releases page on Github
2. Unpack the tar
3. Set the environment variable for $SPARK_HOME.
4. Run the examples!
./bin/spark-bench.sh examples/minimal-example.conf
#SparkBench
74. Future Work & Contributions
• More workloads and data generators
• TPC-DS (on top of existing SQL query benchmark)
• Streaming workloads
• Python
Contributions very welcome from all levels!
Check out our ZenHub Board for details
#SparkBench
75. Shout-Out to My Team
Matthew Schauer
@showermat
Craig Ingram
@cin
Brad Kaiser
@brad-kaiser
*not an actual photo of my team
#SparkBench
76. Summary
Spark-Bench is a flexible, configurable framework for
Testing, Benchmarking, Simulating, Comparing,
and more!
#SparkBench
https://codait.github.io/spark-bench/
78. CODAIT/spark-bench
Emily May Curtin
IBM Watson Data Platform
Atlanta, GA
@emilymaycurtin
@ecurtin
#EUeco8
#sparkbench
https://github.com/CODAIT/spark-bench
https://CODAIT.github.io/spark-bench/
Project
Documentation
Come See Me At DevNexus 2018!#SparkBench