The document discusses the challenges of continuous integration for Apache Spark applications and presents a solution developed by Uncharted Software. It describes squeezing Spark, tests, and other tools into Docker containers to enable building and testing Spark apps across branches in a shared environment. This approach allows automating testing of Spark code commits, detecting issues early, and providing visibility of test results.
Devoxx 2016 Using Jenkins, Gerrit and Spark for Continuous Delivery AnalyticsLuca Milanesio
Our journey and experience in dealing with the collection/analysis of Continuous Delivery log events using Gerrit Code Review, Jenkins with Apache Flume, ElasticSearch, Kibana and Spark
Feedback on the implementation of an architecture allowing to do some Continuous Delivery in both PHP and Java environments for a leading company in the Telecom Industry. We will describe the methods and tools we have implemented as well as the limitations and difficulties we have encountered. Keywords : Industrialization, DevOps, Continuous Delivery, Ansible, Rundeck, Infrastructure As Code
Short overview of the main principles of Continuous Integration (CI), describing benefits of CI and showing a smooth path of integrating CI into your development cycle, finishing with a short introduction into Xinc - PHP CI Server and how to utilize it for your projects.
This is the presentation for the capstone workshop for the Houston R User Group. We walk through package creation, using github to collaborate on code and build a helper package in R for HRUG members.
Devoxx 2016 Using Jenkins, Gerrit and Spark for Continuous Delivery AnalyticsLuca Milanesio
Our journey and experience in dealing with the collection/analysis of Continuous Delivery log events using Gerrit Code Review, Jenkins with Apache Flume, ElasticSearch, Kibana and Spark
Feedback on the implementation of an architecture allowing to do some Continuous Delivery in both PHP and Java environments for a leading company in the Telecom Industry. We will describe the methods and tools we have implemented as well as the limitations and difficulties we have encountered. Keywords : Industrialization, DevOps, Continuous Delivery, Ansible, Rundeck, Infrastructure As Code
Short overview of the main principles of Continuous Integration (CI), describing benefits of CI and showing a smooth path of integrating CI into your development cycle, finishing with a short introduction into Xinc - PHP CI Server and how to utilize it for your projects.
This is the presentation for the capstone workshop for the Houston R User Group. We walk through package creation, using github to collaborate on code and build a helper package in R for HRUG members.
The presentation about the fundamentals of DevOps workflow and CI/CD practices I presented at Centroida (https://centroida.ai/) as a back-end development intern.
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneyWeaveworks
Watch the recording here: https://youtu.be/cakxixc-yQk
❗️ Notifications & Alerts ⚠️
When operating a cluster, different teams may wish to receive notifications about the status of their GitOps pipelines. For example, the on-call team would receive alerts about reconciliation failures in the cluster, while the dev team may wish to be alerted when a new version of an app was deployed and if the deployment is healthy.
Webhook Receivers
The GitOps toolkit controllers are by design pull-based. In order to notify the controllers about changes in Git or Helm repositories, you can setup webhooks and trigger a cluster reconciliation every time a source changes. Using webhook receivers, you can build push-based GitOps pipelines that react to external events.
Alison Dowdney, Developer Experience Engineer at Weaveworks and CNCF Ambassador, walks through how to define a provider, an alert, git commit status, exposing the webhook receiver and defining a git repository and receiver.
Resources
Flux2 Documentation: https://fluxcd.io/docs/
Flux Guide: Setup Notifications: https://fluxcd.io/docs/guides/notifications/
Flux Guide: Setup Webhook receivers: https://fluxcd.io/docs/guides/webhook-receivers/
Flux Roadmap: https://fluxcd.io/docs/roadmap/
Alison's Demo Repo: https://github.com/alisondy/flux-demos
DevOps is the next buzz word that all organization have to apply. This presentation will show you overview of all DevOps Technology you need to learn to transform your organization to DevOps organization.
OSEDA 2017 Seminar at Kasetsart University on December 16, 2017
Hands-on GitOps Patterns for Helm Users YouTube Recording: https://youtu.be/ljouUBPtnuI
There are a lot of opinions on how to structure Flux 2 manifests the "GitOps Way." Flux maintainers have given specific examples of how to properly do this in the Flux user guides, demos, and example repos. But most of these focus on Kustomize, and not yet on patterns for users who want to only use Helm.
In this session, Scott Rigby, Developer Experience Engineer at Weaveworks, shares current work towards GitOps patterns for those who want to only use Helm with Flux 2. We welcome your feedback about use-cases and challenges!
Slide counterparts of video course found here: https://www.youtube.com/playlist?list=PLDshL1Z581YYxLsjYwM25HkIYrymXb7H_
Follow along or just sit back and enjoy a live, hands on tutorial on the power routines of experienced git users. We'll explore with real world examples how to amend commits, do an interactive rebase - and why would you want to do one in the first place, how to solve conflicts without any merge tools, the power of less known merge strategies, how to do interactive commits, and much more.
Course notes
Part 1/8: Introduction
Who am I?
Content overview.
Choice of test project.
Part 2/8: Housekeeping
Find my aliases here.
Enhance your shell with liquidprompt.
Start with an empty commit.
Short cuts to list commits with nice annotations:
Part 3/8: Amending and rebasing
Amend a commit.
Reset a commit to perform a rename.
Different resets affect different parts of a git repository.
Part 4/8: Interactive rebase
How to perform an interactive rebase to remove a binary file stuck in the repository
Part 5/8: Solving conflicts
In this 5th part of the course we'll show a few concepts useful when solving merge and rebase conflicts with an interactive example on how to solve one.
We'll explain --ours, --theirs, conflict markers and what's the process needed to solve a conflict using Git.
Part 6/8: What is a merge and alternative merge strategies
We cover the basics of what a merge is in Git and we show the use of a couple of less known merge strategies like the "ours" merge strategy and the "octopus" strategy.
Part 7/8: Git interactive add
In this part we show how to perform and interactive add using Git. That is splitting the contents of some change across to more semantically meaningful commits.
Part 8/8: How to use Git stash
How to use git stash to solve common workflow situations, swap context with ease and smoothness and look cool to your colleagues.
Git is not just a version control system. Git can change the way you interact with your team members. Lot’s of teams don’t think about reflecting their development workflow in Git and just use it out-of-the-box. Git, however, can be much more powerful, giving your team a boost in productivity, protecting your delivery pipeline and helping you to work better together.
In this session we will start with a central workflow that is used by a lot of Subversion teams. You will learn how to practically integrate ALM solutions like continuous deployment, code reviews, change tracking and much more into your individual workflow. You will find out how to protect your master branch from accidental commits, broken builds and unreviewed code. This presentation will help you discover the best way to work together as a team – whether you’re yet to migrate to Git or even an experienced Git user.
We dreamt of a future where our whole department embraced TDD. A future where the quality of our code and the product was elevated.
We had the best intentions, however, this story does NOT have a happy ending.
Learn from my experience working with the leaders of a department of 40 to attempt 100% TDD adoption. I will contrast this with our successful adoption and spread of SOLID design practices, and look at what we would have done differently with our TDD advocacy if we were to repeat it. Some of our lessons learned: Don't try to mandate TDD, bring in outside experts, and refactoring skills are key.
Apigee Deploy Grunt Plugin - API Management Lifecycle Tool that makes your life easier by providing a JavaScript pluggable framework for API development.
Gradle is an open-source build automation tool focused on flexibility, build reproducibility and performance. Over the years, this tool has evolved and introduced new concepts and features around dependency management, publication and other aspects on build and release of artifacts for the Java platform.
Keeping up to date with all these features across several projects can be challenging. How do you make sure that all your projects can be upgraded to the latest version of Gradle? What if you have thousands of projects and hundreds of engineers? How can you abstract common tasks for them and make sure that new releases work as expected?
At Netflix, we built Nebula, a collection of Gradle plugins that helps engineers remove boilerplate in Gradle build files, and makes building software the Netflix way easy. This reduces the cognitive load on developers, allowing them to focus on writing code.
In this talk, I’ll share with you our philosophy on how to build JVM artifacts and the pieces that help us boost the productivity of engineers at Netflix. I’ll talk about:
- What is Nebula
- What are the common problems we face and try to solve
- How we distribute it to every JVM engineer
- How we ensure that Nebula/Gradle changes do not break builds so we can ship new features with confidence at Netflix
The presentation about the fundamentals of DevOps workflow and CI/CD practices I presented at Centroida (https://centroida.ai/) as a back-end development intern.
Setting up Notifications, Alerts & Webhooks with Flux v2 by Alison DowdneyWeaveworks
Watch the recording here: https://youtu.be/cakxixc-yQk
❗️ Notifications & Alerts ⚠️
When operating a cluster, different teams may wish to receive notifications about the status of their GitOps pipelines. For example, the on-call team would receive alerts about reconciliation failures in the cluster, while the dev team may wish to be alerted when a new version of an app was deployed and if the deployment is healthy.
Webhook Receivers
The GitOps toolkit controllers are by design pull-based. In order to notify the controllers about changes in Git or Helm repositories, you can setup webhooks and trigger a cluster reconciliation every time a source changes. Using webhook receivers, you can build push-based GitOps pipelines that react to external events.
Alison Dowdney, Developer Experience Engineer at Weaveworks and CNCF Ambassador, walks through how to define a provider, an alert, git commit status, exposing the webhook receiver and defining a git repository and receiver.
Resources
Flux2 Documentation: https://fluxcd.io/docs/
Flux Guide: Setup Notifications: https://fluxcd.io/docs/guides/notifications/
Flux Guide: Setup Webhook receivers: https://fluxcd.io/docs/guides/webhook-receivers/
Flux Roadmap: https://fluxcd.io/docs/roadmap/
Alison's Demo Repo: https://github.com/alisondy/flux-demos
DevOps is the next buzz word that all organization have to apply. This presentation will show you overview of all DevOps Technology you need to learn to transform your organization to DevOps organization.
OSEDA 2017 Seminar at Kasetsart University on December 16, 2017
Hands-on GitOps Patterns for Helm Users YouTube Recording: https://youtu.be/ljouUBPtnuI
There are a lot of opinions on how to structure Flux 2 manifests the "GitOps Way." Flux maintainers have given specific examples of how to properly do this in the Flux user guides, demos, and example repos. But most of these focus on Kustomize, and not yet on patterns for users who want to only use Helm.
In this session, Scott Rigby, Developer Experience Engineer at Weaveworks, shares current work towards GitOps patterns for those who want to only use Helm with Flux 2. We welcome your feedback about use-cases and challenges!
Slide counterparts of video course found here: https://www.youtube.com/playlist?list=PLDshL1Z581YYxLsjYwM25HkIYrymXb7H_
Follow along or just sit back and enjoy a live, hands on tutorial on the power routines of experienced git users. We'll explore with real world examples how to amend commits, do an interactive rebase - and why would you want to do one in the first place, how to solve conflicts without any merge tools, the power of less known merge strategies, how to do interactive commits, and much more.
Course notes
Part 1/8: Introduction
Who am I?
Content overview.
Choice of test project.
Part 2/8: Housekeeping
Find my aliases here.
Enhance your shell with liquidprompt.
Start with an empty commit.
Short cuts to list commits with nice annotations:
Part 3/8: Amending and rebasing
Amend a commit.
Reset a commit to perform a rename.
Different resets affect different parts of a git repository.
Part 4/8: Interactive rebase
How to perform an interactive rebase to remove a binary file stuck in the repository
Part 5/8: Solving conflicts
In this 5th part of the course we'll show a few concepts useful when solving merge and rebase conflicts with an interactive example on how to solve one.
We'll explain --ours, --theirs, conflict markers and what's the process needed to solve a conflict using Git.
Part 6/8: What is a merge and alternative merge strategies
We cover the basics of what a merge is in Git and we show the use of a couple of less known merge strategies like the "ours" merge strategy and the "octopus" strategy.
Part 7/8: Git interactive add
In this part we show how to perform and interactive add using Git. That is splitting the contents of some change across to more semantically meaningful commits.
Part 8/8: How to use Git stash
How to use git stash to solve common workflow situations, swap context with ease and smoothness and look cool to your colleagues.
Git is not just a version control system. Git can change the way you interact with your team members. Lot’s of teams don’t think about reflecting their development workflow in Git and just use it out-of-the-box. Git, however, can be much more powerful, giving your team a boost in productivity, protecting your delivery pipeline and helping you to work better together.
In this session we will start with a central workflow that is used by a lot of Subversion teams. You will learn how to practically integrate ALM solutions like continuous deployment, code reviews, change tracking and much more into your individual workflow. You will find out how to protect your master branch from accidental commits, broken builds and unreviewed code. This presentation will help you discover the best way to work together as a team – whether you’re yet to migrate to Git or even an experienced Git user.
We dreamt of a future where our whole department embraced TDD. A future where the quality of our code and the product was elevated.
We had the best intentions, however, this story does NOT have a happy ending.
Learn from my experience working with the leaders of a department of 40 to attempt 100% TDD adoption. I will contrast this with our successful adoption and spread of SOLID design practices, and look at what we would have done differently with our TDD advocacy if we were to repeat it. Some of our lessons learned: Don't try to mandate TDD, bring in outside experts, and refactoring skills are key.
Apigee Deploy Grunt Plugin - API Management Lifecycle Tool that makes your life easier by providing a JavaScript pluggable framework for API development.
Gradle is an open-source build automation tool focused on flexibility, build reproducibility and performance. Over the years, this tool has evolved and introduced new concepts and features around dependency management, publication and other aspects on build and release of artifacts for the Java platform.
Keeping up to date with all these features across several projects can be challenging. How do you make sure that all your projects can be upgraded to the latest version of Gradle? What if you have thousands of projects and hundreds of engineers? How can you abstract common tasks for them and make sure that new releases work as expected?
At Netflix, we built Nebula, a collection of Gradle plugins that helps engineers remove boilerplate in Gradle build files, and makes building software the Netflix way easy. This reduces the cognitive load on developers, allowing them to focus on writing code.
In this talk, I’ll share with you our philosophy on how to build JVM artifacts and the pieces that help us boost the productivity of engineers at Netflix. I’ll talk about:
- What is Nebula
- What are the common problems we face and try to solve
- How we distribute it to every JVM engineer
- How we ensure that Nebula/Gradle changes do not break builds so we can ship new features with confidence at Netflix
Our DevOps Journey
Transforming 6 Month Waterfalls to 1 Hour Code Deploys
https://info.dynatrace.com/17q3_wc_from_agile_to_cloudy_devops_na_registration.html
In the 2nd part of our webinar series, Anita Engleder, DevOps Lead at Dynatrace reviews and dissects lessons learned during the transformational journey moving Dynatrace from an on-prem culture to one that is cloud native. She will lend her perspective as a key member of the team that executed on the original vision: to “implement a new cloud native offering and deploy a new feature release every 2 weeks. Additionally, be able to support a 1-hour lead time from Code Change to Production”.
On November 17th at 1pm/10am PT Anita will present the challenges she and her team faced transforming 6 Months Waterfall to 1 Hour Code Deploys.
In this webinar Anita will discuss:
How to enable a complete cultural shift across multiple teams, in terms of thought process AND execution
What the specific role of her DevOps team is and how it played into the transformation
The role of Feature teams and why continuous feedback is critical for them
How to successfully influence key stakeholders for complete alignment
Today Anita’s team runs 170 production changes every day, running across several AWS Data Centers as well as On-Premise – something that would have been thought impossible only a few years prior.
Continuous Load Testing with CloudTest and JenkinsSOASTA
Two key challenges to continuous load testing are provisioning a test system to handle the load and accessing load generators to drive the traffic.
In this webinar from SOASTA & CloudBees, you will learn how to:
Build realistic automated web performance tests and run them in Jenkins
Architect and launch a test environment that auto-provisions in the cloud
Manage a load generation grid to drive load tests in a lights-out mode
Establish a performance baseline in your daily Jenkins reports
From 0 to DevOps in 80 Days [Webinar Replay]Dynatrace
From 0 to DevOps in 80 Days
Link to the webinar replay: https://info.dynatrace.com/apm_dtm_ops_17q3_wc_from_enterprise_tocloud_native_na_registration.html
“Innovate or die” may sound extreme, but it’s the only way to thrive in today’s ever competitive market. Bernd Greifeneder, CTO of Dynatrace, wanted to ensure that the company was relevant 5 years from now so he formed an internal incubator with one goal: transform Dynatrace into a Cloud Native DevOps organization.
The incubator focused on what the company needed to do in order to integrate nascent cloud technologies so that they wouldn’t be left in the dust when the inevitable tipping point to cloud arrives. Transforming into a cloud native company would allow for rapid release cycles and provide an embedded feedback loop.
The Results: Dynatrace now has a 99.998% availability of SaaS Service and can deploy changes within an hour if necessary. In parallel, a new SaaS and managed offering is released every 2 weeks with 170 production updates per day.
Watch this recorded webinar as Bernd Greifeneder shares the lessons learned moving Dynatrace from an on-prem company to one that is cloud native.
Bernd discusses:
• The driving factors that led to the transformation
• The goals that were set back in 2011 towards the engineering team
• How to sell such a transformation project in a large enterprise organization
• How to support this multi-year project from top down without impacting regular operations
• What's next on the innovator's mind
Emulators as an Emerging Best Practice for API ProvidersCisco DevNet
Modern software leverage a lot of external APIs, which raises several issues: how can development teams safely build and test their software when some components they rely on keep evolving. And what if some of these components are key for the whole software to even fire up. These are some of the challenges faced today by Cisco internal development teams, but also partner and enterprise developers using Cisco's software. The industry proposes two main strategies to circumvent these challenges: API Mocking, and Service Virtualization. Both of these strategies have pro and cons. I came up with the idea of proposing API emulators has a 3rd option. From another perspective, Serverless platforms are recognized to be unique as they are scalable and straightforward to use. It turns out that major industry vendors (Google, Amazon, Microsoft…) have invested to create Emulator as companions to their Serverless platforms, in order to attract, on-board and have developers adopt Serverless platforms. In this talk, I'll explain the motivation behind API emulators, in the perspective of DevOps, CI/CD, Development Processes, and Serverless/Microservices architectures. I'll illustrate the talk with several prototypes: - Mini-Spark: an Emulator for the Cisco Spark API which mimics the execution environment of the Cisco Spark Cloud platform - Tropo-Ready: an Emulator for Tropo Serverless Telephony platform, tied it to Visual Studio Code in order to reach to developer communities I'll dive into the details of the process to create such emulators, and elaborate on the consideration that Emulators are emerging as key components of API providers's strategy.
Continuous delivery requires more that DevOps. It also requires one to think differently about product design, development & testing, and the overall structure of the organization. This presentation will help you understand what it takes and why one would want to deliver value to your customers multiple times each day. #CIC
Jeff "Cheezy" Morgan Ardita Karaj
Join Perfecto & CloudBees for a presentation on how to drive mobile app quality feedback in every build, on real devices. Watch a demo featuring the CloudBees Jenkins Workflow showcasing automated testing with Perfecto's Continuous Quality Lab.
Code Coverage for Total Security in Application MigrationsDana Luther
So the time has come to take the leap and upgrade your application to a new major version of the underlying framework, or, perhaps, to an entirely different framework... how do you ensure that none of your functionality or usability is impacted by a potentially drastic rewrite of the underlying systems? How can you move forward with 100% confidence in your migrated codebase? Testing, testing and more testing. Using a combination of unit, functional and acceptance tests can give you the certainty you need. In this talk, we will go over key strategies for ensuring that you begin with full code coverage and move forward with confidence.
Continuous Load Testing with CloudTest and JenkinsSOASTA
Two key challenges to continuous load testing are provisioning a test system to handle the load and accessing load generators to drive the traffic.
In this webinar from SOASTA & CloudBees, you will learn how to:
Build realistic automated web performance tests and run them in Jenkins
Architect and launch a test environment that auto-provisions in the cloud
Manage a load generation grid to drive load tests in a lights-out mode
Establish a performance baseline in your daily Jenkins reports
Presentation of SAPUI5/OpenUI5 Continuous Integration infrastructure for DSAG (German-Speaker UserGroup) workgroup for UI technologies on Jan 25th, 2017.
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
In this session we will present a Configurable FPGA-Based Spark SQL Acceleration Architecture. It is target to leverage FPGA highly parallel computing capability to accelerate Spark SQL Query and for FPGA’s higher power efficiency than CPU we can lower the power consumption at the same time. The Architecture consists of SQL query decomposition algorithms, fine-grained FPGA based Engine Units which perform basic computation of sub string, arithmetic and logic operations. Using SQL query decomposition algorithm, we are able to decompose a complex SQL query into basic operations and according to their patterns each is fed into an Engine Unit. SQL Engine Units are highly configurable and can be chained together to perform complex Spark SQL queries, finally one SQL query is transformed into a Hardware Pipeline. We will present the performance benchmark results comparing the queries with FGPA-Based Spark SQL Acceleration Architecture on XEON E5 and FPGA to the ones with Spark SQL Query on XEON E5 with 10X ~ 100X improvement and we will demonstrate one SQL query workload from a real customer.
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
In this talk, we’ll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix’s famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the “missing MatPlotLib” for Spark/Scala. We’ll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
This presentation introduces how we design and implement a real-time processing platform using latest Spark Structured Streaming framework to intelligently transform the production lines in the manufacturing industry. In the traditional production line there are a variety of isolated structured, semi-structured and unstructured data, such as sensor data, machine screen output, log output, database records etc. There are two main data scenarios: 1) Picture and video data with low frequency but a large amount; 2) Continuous data with high frequency. They are not a large amount of data per unit. However the total amount of them is very large, such as vibration data used to detect the quality of the equipment. These data have the characteristics of streaming data: real-time, volatile, burst, disorder and infinity. Making effective real-time decisions to retrieve values from these data is critical to smart manufacturing. The latest Spark Structured Streaming framework greatly lowers the bar for building highly scalable and fault-tolerant streaming applications. Thanks to the Spark we are able to build a low-latency, high-throughput and reliable operation system involving data acquisition, transmission, analysis and storage. The actual user case proved that the system meets the needs of real-time decision-making. The system greatly enhance the production process of predictive fault repair and production line material tracking efficiency, and can reduce about half of the labor force for the production lines.
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
Graph is on the rise and it’s time to start learning about scalable graph analytics! In this session we will go over two Spark-based Graph Analytics frameworks: Tinkerpop and GraphFrames. While both frameworks can express very similar traversals, they have different performance characteristics and APIs. In this Deep-Dive by example presentation, we will demonstrate some common traversals and explain how, at a Spark level, each traversal is actually computed under the hood! Learn both the fluent Gremlin API as well as the powerful GraphFrame Motif api as we show examples of both simultaneously. No need to be familiar with Graphs or Spark for this presentation as we’ll be explaining everything from the ground up!
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
Building accurate machine learning models has been an art of data scientists, i.e., algorithm selection, hyper parameter tuning, feature selection and so on. Recently, challenges to breakthrough this “black-arts” have got started. In cooperation with our partner, NEC Laboratories America, we have developed a Spark-based automatic predictive modeling system. The system automatically searches the best algorithm, parameters and features without any manual work. In this talk, we will share how the automation system is designed to exploit attractive advantages of Spark. The evaluation with real open data demonstrates that our system can explore hundreds of predictive models and discovers the most accurate ones in minutes on a Ultra High Density Server, which employs 272 CPU cores, 2TB memory and 17TB SSD in 3U chassis. We will also share open challenges to learn such a massive amount of models on Spark, particularly from reliability and stability standpoints. This talk will cover the presentation already shown on Spark Summit SF’17 (#SFds5) but from more technical perspective.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
In Sweden, from the Rise ICE Data Center at www.hops.site, we are providing to reseachers both Spark-as-a-Service and, more recently, Tensorflow-as-a-Service as part of the Hops platform. In this talk, we examine the different ways in which Tensorflow can be included in Spark workflows, from batch to streaming to structured streaming applications. We will analyse the different frameworks for integrating Spark with Tensorflow, from Tensorframes to TensorflowOnSpark to Databrick’s Deep Learning Pipelines. We introduce the different programming models supported and highlight the importance of cluster support for managing different versions of python libraries on behalf of users. We will also present cluster management support for sharing GPUs, including Mesos and YARN (in Hops Hadoop). Finally, we will perform a live demonstration of training and inference for a TensorflowOnSpark application written on Jupyter that can read data from either HDFS or Kafka, transform the data in Spark, and train a deep neural network on Tensorflow. We will show how to debug the application using both Spark UI and Tensorboard, and how to examine logs and monitor training.
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
With the rapid growth of available datasets, it is imperative to have good tools for extracting insight from big data. The Spark ML library has excellent support for performing at-scale data processing and machine learning experiments, but more often than not, Data Scientists find themselves struggling with issues such as: low level data manipulation, lack of support for image processing, text analytics and deep learning, as well as the inability to use Spark alongside other popular machine learning libraries. To address these pain points, Microsoft recently released The Microsoft Machine Learning Library for Apache Spark (MMLSpark), an open-source machine learning library built on top of SparkML that seeks to simplify the data science process and integrate SparkML Pipelines with deep learning and computer vision libraries such as the Microsoft Cognitive Toolkit (CNTK) and OpenCV. With MMLSpark, Data Scientists can build models with 1/10th of the code through Pipeline objects that compose seamlessly with other parts of the SparkML ecosystem. In this session, we explore some of the main lessons learned from building MMLSpark. Join us if you would like to know how to extend Pipelines to ensure seamless integration with SparkML, how to auto-generate Python and R wrappers from Scala Transformers and Estimators, how to integrate and use previously non-distributed libraries in a distributed manner and how to efficiently deploy a Spark library across multiple platforms.
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
The Next Accelerator Logging Service (NXCALS) is a new Big Data project at CERN aiming to replace the existing Oracle-based service.
The main purpose of the system is to store and present Controls/Infrastructure related data gathered from thousands of devices in the whole accelerator complex.
The data is used to operate the machines, improve their performance and conduct studies for new beam types or future experiments.
During this talk, Jakub will speak about NXCALS requirements and design choices that lead to the selected architecture based on Hadoop and Spark. He will present the Ingestion API, the abstractions behind the Meta-data Service and the Spark-based Extraction API where simple changes to the schema handling greatly improved the overall usability of the system. The system itself is not CERN specific and can be of interest to other companies or institutes confronted with similar Big Data problems.
Powering a Startup with Apache Spark with Kevin KimSpark Summit
In Between (A mobile App for couples, downloaded 20M in Global), from daily batch for extracting metrics, analysis and dashboard. Spark is widely used by engineers and data analysts in Between, thanks to the performance and expendability of Spark, data operating has become extremely efficient. Entire team including Biz Dev, Global Operation, Designers are enjoying data results so Spark is empowering entire company for data driven operation and thinking. Kevin, Co-founder and Data Team leader of Between will be presenting how things are going in Between. Listeners will know how small and agile team is living with data (how we build organization, culture and technical base) after this presentation.
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
As common sense would suggest, weather has a definite impact on traffic. But how much? And under what circumstances? Can we improve traffic (congestion) prediction given weather data? Predictive traffic is envisioned to significantly impact how driver’s plan their day by alerting users before they travel, find the best times to travel, and over time, learn from new IoT data such as road conditions, incidents, etc. This talk will cover the traffic prediction work conducted jointly by IBM and the traffic data provider. As a part of this work, we conducted a case study over five large metropolitans in the US, 2.58 billion traffic records and 262 million weather records, to quantify the boost in accuracy of traffic prediction using weather data. We will provide an overview of our lambda architecture with Apache Spark being used to build prediction models with weather and traffic data, and Spark Streaming used to score the model and provide real-time traffic predictions. This talk will also cover a suite of extensions to Spark to analyze geospatial and temporal patterns in traffic and weather data, as well as the suite of machine learning algorithms that were used with Spark framework. Initial results of this work were presented at the National Association of Broadcasters meeting in Las Vegas in April 2017, and there is work to scale the system to provide predictions in over a 100 cities. Audience will learn about our experience scaling using Spark in offline and streaming mode, building statistical and deep-learning pipelines with Spark, and techniques to work with geospatial and time-series data.
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
In many cases, Big Data becomes just another buzzword because of the lack of tools that can support both the technological requirements for developing and deploying of the projects and/or the fluency of communication between the different profiles of people involved in the projects.
In this talk, we will present Moriarty, a set of tools for fast prototyping of Big Data applications that can be deployed in an Apache Spark environment. These tools support the creation of Big Data workflows using the already existing functional blocks or supporting the creation of new functional blocks. The created workflow can then be deployed in a Spark infrastructure and used through a REST API.
For better understanding of Moriarty, the prototyping process and the way it hides the Spark environment to the Big Data users and developers, we will present it together with a couple of examples based on a Industry 4.0 success cases and other on a logistic success case.
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
Large-scale testing of new data products or enhancements to existing products in a research and development environment can be a technical challenge for data scientists. In some cases, tools available to data scientists lack production-level capacity, whereas other tools do not provide the algorithms needed to run the methodology. At Nielsen, the Databricks platform provided a solution to both of these challenges. This breakout session will cover a specific Nielsen business case where two methodology enhancements were developed and tested at large-scale using the Databricks platform. Development and large-scale testing of these enhancements would not have been possible using standard database tools.
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
Data lineage tracking is one of the significant problems that financial institutions face when using modern big data tools. This presentation describes Spline – a data lineage tracking and visualization tool for Apache Spark. Spline captures and stores lineage information from internal Spark execution plans and visualizes it in a user-friendly manner.
Goal Based Data Production with Sim SimeonovSpark Summit
Since the invention of SQL and relational databases, data production has been about specifying how data is transformed through queries. While Apache Spark can certainly be used as a general distributed query engine, the power and granularity of Spark’s APIs enables a revolutionary increase in data engineering productivity: goal-based data production. Goal-based data production concerns itself with specifying WHAT the desired result is, leaving the details of HOW the result is achieved to a smart data warehouse running on top of Spark. That not only substantially increases productivity, but also significantly expands the audience that can work directly with Spark: from developers and data scientists to technical business users. With specific data and architecture patterns spanning the range from ETL to machine learning data prep and with live demos, this session will demonstrate how Spark users can gain the benefits of goal-based data production.
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
Have you imagined a simple machine learning solution able to prevent revenue leakage and monitor your distributed application? To answer this question, we offer a practical and a simple machine learning solution to create an intelligent monitoring application based on simple data analysis using Apache Spark MLlib. Our application uses linear regression models to make predictions and check if the platform is experiencing any operational problems that can impact in revenue losses. The application monitor distributed systems and provides notifications stating the problem detected, that way users can operate quickly to avoid serious problems which directly impact the company’s revenue and reduce the time for action. We will present an architecture for not only a monitoring system, but also an active actor for our outages recoveries. At the end of the presentation you will have access to our training program source code and you will be able to adapt and implement in your company. This solution already helped to prevent about US$3mi in losses last year.
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
Getting Ready to use Redis with Apache Spark is a technical tutorial designed to address integrating Redis with an Apache Spark deployment to increase the performance of serving complex decision models. To set the context for the session, we start with a quick introduction to Redis and the capabilities Redis provides. We cover the basic data types provided by Redis and cover the module system. Using an ad serving use-case, we look at how Redis can improve the performance and reduce the cost of using complex ML-models in production. Attendees will be guided through the key steps of setting up and integrating Redis with Spark, including how to train a model using Spark then load and serve it using Redis, as well as how to work with the Spark Redis module. The capabilities of the Redis Machine Learning Module (redis-ml) will be discussed focusing primarily on decision trees and regression (linear and logistic) with code examples to demonstrate how to use these feature. At the end of the session, developers should feel confident building a prototype/proof-of-concept application using Redis and Spark. Attendees will understand how Redis complements Spark and how to use Redis to serve complex, ML-models with high performance.
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
Here we present a general supervised framework for record deduplication and author-disambiguation via Spark. This work differentiates itself by – Application of Databricks and AWS makes this a scalable implementation. Compute resources are comparably lower than traditional legacy technology using big boxes 24/7. Scalability is crucial as Elsevier’s Scopus data, the biggest scientific abstract repository, covers roughly 250 million authorships from 70 million abstracts covering a few hundred years. – We create a fingerprint for each content by deep learning and/or word2vec algorithms to expedite pairwise similarity calculation. These encoders substantially reduce compute time while maintaining semantic similarity (unlike traditional TFIDF or predefined taxonomies). We will briefly discuss how to optimize word2vec training with high parallelization. Moreover, we show how these encoders can be used to derive a standard representation for all our entities namely such as documents, authors, users, journals, etc. This standard representation can simplify the recommendation problem into a pairwise similarity search and hence it can offer a basic recommender for cross-product applications where we may not have a dedicate recommender engine designed. – Traditional author-disambiguation or record deduplication algorithms are batch-processing with small to no training data. However, we have roughly 25 million authorships that are manually curated or corrected upon user feedback. Hence, it is crucial to maintain historical profiles and hence we have developed a machine learning implementation to deal with data streams and process them in mini batches or one document at a time. We will discuss how to measure the accuracy of such a system, how to tune it and how to process the raw data of pairwise similarity function into final clusters. Lessons learned from this talk can help all sort of companies where they want to integrate their data or deduplicate their user/customer/product databases.
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
The use of large-scale machine learning and data mining methods is becoming ubiquitous in many application domains ranging from business intelligence and bioinformatics to self-driving cars. These methods heavily rely on matrix computations, and it is hence critical to make these computations scalable and efficient. These matrix computations are often complex and involve multiple steps that need to be optimized and sequenced properly for efficient execution. This work presents new efficient and scalable matrix processing and optimization techniques based on Spark. The proposed techniques estimate the sparsity of intermediate matrix-computation results and optimize communication costs. An evaluation plan generator for complex matrix computations is introduced as well as a distributed plan optimizer that exploits dynamic cost-based analysis and rule-based heuristics The result of a matrix operation will often serve as an input to another matrix operation, thus defining the matrix data dependencies within a matrix program. The matrix query plan generator produces query execution plans that minimize memory usage and communication overhead by partitioning the matrix based on the data dependencies in the execution plan. We implemented the proposed matrix techniques inside the Spark SQL, and optimize the matrix execution plan based on Spark SQL Catalyst. We conduct case studies on a series of ML models and matrix computations with special features on different datasets. These are PageRank, GNMF, BFGS, sparse matrix chain multiplications, and a biological data analysis. The open-source library ScaLAPACK and the array-based database SciDB are used for performance evaluation. Our experiments are performed on six real-world datasets are: social network data ( e.g., soc-pokec, cit-Patents, LiveJournal), Twitter2010, Netflix recommendation data, and 1000 Genomes Project sample. Experiments demonstrate that our proposed techniques achieve up to an order-of-magnitude performance.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.