In this demo based talk with live coding, we’ll present a functional typeful framework for developing Apache Spark applications. We’ll walk through the following key topics: – turning unmanageable Spark scripts into typeful Spark Functions – serverless deployment of Spark functions into the cloud – unit testing Spark functions to save cluster resources and developers time – seamless Spark session management between concurrent Spark jobs in exclusive or share modes
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
Apache Spark is already a proven leader as a heavy-weight distributed processing framework. But how does one migrate an 8-years-old, single-server, MySQL-based legacy system to such new shiny frameworks? How do you accurately preserve the behavior of a system consuming Terabytes of data every day, hiding numerous undocumented implicit gotchas and changing constantly, while shifting to brand new development paradigms? In this talk we present Kenshoo’s attempt at this challenge, where we migrated a legacy aggregation system to Spark. Our solutions include heavy usage of metrics and graphite for analyzing production data; “local-mode” client enabling reuse of legacy tests suits; data validations using side-by-side execution; and maximum reuse of code through refactoring and composition. Some of these solutions use Spark-specific characteristics and features.
Crossing the Bridge: Connecting Rails and your Front-end FrameworkDaniel Spector
Presented at Railsconf 2015 by Daniel Spector, @danielspecs.
Crossing the Bridge explores tools, patterns and best practices to connect your Javascript MVC framework to Rails in the most seamless way possible. The talk progresses from demonstrating the standard API request cycle to preloading data to your client-side framework to rendering your javascript on the server. It explores Isomorphic Javascript and ways of implementing it with Rails.
Spark is quickly becoming the most popular framework in the MapReduce family. With better performance and much better APIs - it's easier than ever to perform the actual data wrangling; But as always - the challenges of operating, verifying and optimizing your application over time are much greater than the initial setup - and all the more so with distributes systems. In Kenshoo, we've used and developed some tools and techniques to monitor the state of our Spark application: health, correctness, performance, utilization, and business KPIs. We'll discuss some standard tools and less standard techniques to get the most information out of your Spark cluster.
Description of the API concept for engineering and how it can be useful. Particularly how it should be used with respect to genomics data. Finally, an analogy of the API concept in synthetic biology and how evolution allows encapsulation.
Apache Spark is already a proven leader as a heavy-weight distributed processing framework. But how does one migrate an 8-years-old, single-server, MySQL-based legacy system to such new shiny frameworks? How do you accurately preserve the behavior of a system consuming Terabytes of data every day, hiding numerous undocumented implicit gotchas and changing constantly, while shifting to brand new development paradigms? In this talk we present Kenshoo’s attempt at this challenge, where we migrated a legacy aggregation system to Spark. Our solutions include heavy usage of metrics and graphite for analyzing production data; “local-mode” client enabling reuse of legacy tests suits; data validations using side-by-side execution; and maximum reuse of code through refactoring and composition. Some of these solutions use Spark-specific characteristics and features.
Crossing the Bridge: Connecting Rails and your Front-end FrameworkDaniel Spector
Presented at Railsconf 2015 by Daniel Spector, @danielspecs.
Crossing the Bridge explores tools, patterns and best practices to connect your Javascript MVC framework to Rails in the most seamless way possible. The talk progresses from demonstrating the standard API request cycle to preloading data to your client-side framework to rendering your javascript on the server. It explores Isomorphic Javascript and ways of implementing it with Rails.
Spark is quickly becoming the most popular framework in the MapReduce family. With better performance and much better APIs - it's easier than ever to perform the actual data wrangling; But as always - the challenges of operating, verifying and optimizing your application over time are much greater than the initial setup - and all the more so with distributes systems. In Kenshoo, we've used and developed some tools and techniques to monitor the state of our Spark application: health, correctness, performance, utilization, and business KPIs. We'll discuss some standard tools and less standard techniques to get the most information out of your Spark cluster.
apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...apidays
apidays LIVE Australia 2020 - Building Business Ecosystems
Building distributed systems on the shoulders of giants
Dasith Wijesiriwardena, Telstra Purple (Readify)
Swift was originally released in 2014, and Open Sourced by Apple in late 2015. The Open Source release generated an explosion of community interest and support, resulting in ports to other platforms and significant language changes. Swift version 3, which reflects the results of much of this work, was released in September of 2016, bringing with it some significant refinements to the core language and a new package manager.
Swift is a multi-paradigm language, supporting imperative, functional, and object-oriented programming styles. The language is strongly typed but has extensive support for type inference and substantial tooling available in XCode to identify and in some cases automatically fix common programming errors. Swift uses a memory management strategy called automatic reference counting (ARC), freeing programmers from the tedium of manually managing memory allocation. This combination of strong typing, maximal type inference, automatic reference counting (ARC), and excellent tooling results in an experience that can be described as “the Macintosh of programming languages”.
This talk will present some of the history of the development of Swift with emphasis on how the Open Source release of the language kick-started activity, review the basic syntax of Swift (with comparisons to similar languages that attendees may be more familiar with), and describe what tools are available to help learn the language, including XCode, the Swift REPL available from XCode, and the new Swift Playgrounds for iPad that debuted with Swift 3 and iOS10. After attending this talk, an attendee with no previous Swift experience will understand exactly why they should be excited about this relatively new programming language and be up to date on exactly what they need to do to dive into Swift coding for themselves.
I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.
Oracle Application Express (APEX) is shipped with several JavaScript libraries, jQuery being the best known one of them. And on top of these libraries the APEX Development Team created their own. You probably used a couple of these API's already, like $s, $v etc.
But there are way more and some of them are extremely useful. But first you have to be aware they exists. And secondly you have to know how to use the properly.
This session will cover the most valuable JavaScript API's with some real world examples.
Most developers stick to the standard $s and $v functions - even without knowing there is also a $v2 and $s can have more parameters.
The focus will be on the namespaced apex API's, like apex.server.process and apex.event.trigger.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
InterConnect: Java, Node.js and Swift - Which, Why and WhenChris Bailey
Java, Node.js, and Swift are three of the most popular and effective programming languages in use today. When presented with an opportunity to choose, it may not be clear which language is best suited for the job. This session will provide a tour of these languages and the use cases for which each is best suited.
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
In this talk from Lucene/Solr Revolution 2015, Solr and centralized logging experts Radu Gheorghe and Rafal Kuć cover topics like: flow in Logstash, flow in rsyslog, parsing JSON, log shipping, Solr tuning, time-based collections and tiered clusters.
or "Towards a Standard TAPI", presented at AUSOUG Connect Perth, November 2016. I've been using a combination of Table APIs and Transaction APIs to build complex but maintainable applications in Apex - something I encourage everyone to at least consider.
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharpergranicz
Reactive programming has become an indispensable tool for solving many of the challenges in data-driven applications, desktop and web alike. A quick survey of the reactive landscape reveals a staggering variety of choices available to developers, and it is a smaller challenge nowadays to choose the right technology to build on. In this talk, you will learn about applying functional programming and F# (and C#) to develop full-stack, reactive web applications and microservices using WebSharper, a mature open source web framework/ecosystem for developing enterprise-grade .NET web applications.
Hooray, open source Swift finally arrived on Linux in December. Let’s see how easy it is to use Swift for your backend and why Swift is a good choice for safe and fast development.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
apidays LIVE Australia 2020 - Building distributed systems on the shoulders o...apidays
apidays LIVE Australia 2020 - Building Business Ecosystems
Building distributed systems on the shoulders of giants
Dasith Wijesiriwardena, Telstra Purple (Readify)
Swift was originally released in 2014, and Open Sourced by Apple in late 2015. The Open Source release generated an explosion of community interest and support, resulting in ports to other platforms and significant language changes. Swift version 3, which reflects the results of much of this work, was released in September of 2016, bringing with it some significant refinements to the core language and a new package manager.
Swift is a multi-paradigm language, supporting imperative, functional, and object-oriented programming styles. The language is strongly typed but has extensive support for type inference and substantial tooling available in XCode to identify and in some cases automatically fix common programming errors. Swift uses a memory management strategy called automatic reference counting (ARC), freeing programmers from the tedium of manually managing memory allocation. This combination of strong typing, maximal type inference, automatic reference counting (ARC), and excellent tooling results in an experience that can be described as “the Macintosh of programming languages”.
This talk will present some of the history of the development of Swift with emphasis on how the Open Source release of the language kick-started activity, review the basic syntax of Swift (with comparisons to similar languages that attendees may be more familiar with), and describe what tools are available to help learn the language, including XCode, the Swift REPL available from XCode, and the new Swift Playgrounds for iPad that debuted with Swift 3 and iOS10. After attending this talk, an attendee with no previous Swift experience will understand exactly why they should be excited about this relatively new programming language and be up to date on exactly what they need to do to dive into Swift coding for themselves.
I strongly believe in the combination of Apache Spark with Java. In this tutorial, prepared for NCDevCon, we are going through the basics of Spark as well as 2 examples: a basic ingestion and an analytics example based on joins & group by. Follow me @jgperrin.
Oracle Application Express (APEX) is shipped with several JavaScript libraries, jQuery being the best known one of them. And on top of these libraries the APEX Development Team created their own. You probably used a couple of these API's already, like $s, $v etc.
But there are way more and some of them are extremely useful. But first you have to be aware they exists. And secondly you have to know how to use the properly.
This session will cover the most valuable JavaScript API's with some real world examples.
Most developers stick to the standard $s and $v functions - even without knowing there is also a $v2 and $s can have more parameters.
The focus will be on the namespaced apex API's, like apex.server.process and apex.event.trigger.
Debugging PySpark: Spark Summit East talk by Holden KarauSpark Summit
Apache Spark is one of the most popular big data projects, offering greatly improved performance over traditional MapReduce models. Much of Apache Spark’s power comes from lazy evaluation along with intelligent pipelining, which can make debugging more challenging. This talk will examine how to debug Apache Spark applications, the different options for logging in Spark’s variety of supported languages, as well as some common errors and how to detect them.
Spark’s own internal logging can often be quite verbose, and this talk will examine how to effectively search logs from Apache Spark to spot common problems. In addition to the internal logging, this talk will look at options for logging from within our program itself.
Spark’s accumulators have gotten a bad rap because of how they interact in the event of cache misses or partial recomputes, but this talk will look at how to effectively use Spark’s current accumulators for debugging as well as a look to future for data property type accumulators which may be coming to Spark in future version.
In addition to reading logs, and instrumenting our program with accumulators, Spark’s UI can be of great help for quickly detecting certain types of problems.
Apache Hive Hook
I couldn't find enough info about Hive hooks.
So, I made this.
I hope this presentation will be useful when you want to use hooks.
This included some infomation about metastore event listeners.
This was written based on release-0.11 tag.
InterConnect: Java, Node.js and Swift - Which, Why and WhenChris Bailey
Java, Node.js, and Swift are three of the most popular and effective programming languages in use today. When presented with an opportunity to choose, it may not be clear which language is best suited for the job. This session will provide a tour of these languages and the use cases for which each is best suited.
Large Scale Log Analytics with Solr (from Lucene Revolution 2015)Sematext Group, Inc.
In this talk from Lucene/Solr Revolution 2015, Solr and centralized logging experts Radu Gheorghe and Rafal Kuć cover topics like: flow in Logstash, flow in rsyslog, parsing JSON, log shipping, Solr tuning, time-based collections and tiered clusters.
or "Towards a Standard TAPI", presented at AUSOUG Connect Perth, November 2016. I've been using a combination of Table APIs and Transaction APIs to build complex but maintainable applications in Apex - something I encourage everyone to at least consider.
GDG Almaty Meetup: Reactive full-stack .NET web applications with WebSharpergranicz
Reactive programming has become an indispensable tool for solving many of the challenges in data-driven applications, desktop and web alike. A quick survey of the reactive landscape reveals a staggering variety of choices available to developers, and it is a smaller challenge nowadays to choose the right technology to build on. In this talk, you will learn about applying functional programming and F# (and C#) to develop full-stack, reactive web applications and microservices using WebSharper, a mature open source web framework/ecosystem for developing enterprise-grade .NET web applications.
Hooray, open source Swift finally arrived on Linux in December. Let’s see how easy it is to use Swift for your backend and why Swift is a good choice for safe and fast development.
My talk at Data Science Labs conference in Odessa.
Training a model in Apache Spark while having it automatically available for real-time serving is an essential feature for end-to-end solutions.
There is an option to export the model into PMML and then import it into a separated scoring engine. The idea of interoperability is great but it has multiple challenges, such as code duplication, limited extensibility, inconsistency, extra moving parts. In this talk we discussed an alternative solution that does not introduce custom model formats and new standards, not based on export/import workflow and shares Apache Spark API.
A lecture on Apace Spark, the well-known open source cluster computing framework. The course consisted of three parts: a) install the environment through Docker, b) introduction to Spark as well as advanced features, and c) hands-on training on three (out of five) of its APIs, namely Core, SQL \ Dataframes, and MLlib.
Author: Stefan Papp, Data Architect at “The unbelievable Machine Company“. An overview of Big Data Processing engines with a focus on Apache Spark and Apache Flink, given at a Vienna Data Science Group meeting on 26 January 2017. Following questions are addressed:
• What are big data processing paradigms and how do Spark 1.x/Spark 2.x and Apache Flink solve them?
• When to use batch and when stream processing?
• What is a Lambda-Architecture and a Kappa Architecture?
• What are the best practices for your project?
Burn down the silos! Helping dev and ops gel on high availability websitesLindsay Holmwood
HA websites are where the rubber meets the road - at 200km/h. Traditional separation of dev and ops just doesn't cut it.
Everything is related to everything. Code relies on performant and resilient infrastructure, but highly performant infrastructure will only get a poorly written application so far. Worse still, root cause analysis in HA sites will more often than not identify problems that don't clearly belong to either devs or ops.
The two options are collaborate or die.
This talk will introduce 3 core principles for improving collaboration between operations and development teams: consistency, repeatability, and visibility. These principles will be investigated with real world case studies and associated technologies audience members can start using now. In particular, there will be a focus on:
- fast provisioning of test environments with configuration management
- reliable and repeatable automated deployments
- application and infrastructure visibility with statistics collection, logging, and visualisation
Michal Malohlava talks about the PySparkling Water package for Spark and Python users.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
High concurrency, Low latency analytics using Spark/KuduChris George
With the right combination of open source projects, you can have a high concurrency and low latency spark jobs for doing data analysis. We'll show both REST and JDBC access to access data from a persistent spark context and then show how the combination of Spark Job Server, Spark Thrift Server and Apache Kudu can create a scalable backend for low latency analytics.
Agile Data Science 2.0 (O'Reilly 2017) defines a methodology and a software stack with which to apply the methods. *The methodology* seeks to deliver data products in short sprints by going meta and putting the focus on the applied research process itself. *The stack* is but an example of one meeting the requirements that it be utterly scalable and utterly efficient in use by application developers as well as data engineers. It includes everything needed to build a full-blown predictive system: Apache Spark, Apache Kafka, Apache Incubating Airflow, MongoDB, ElasticSearch, Apache Parquet, Python/Flask, JQuery. This talk will cover the full lifecycle of large data application development and will show how to use lessons from agile software engineering to apply data science using this full-stack to build better analytics applications. The entire lifecycle of big data application development is discussed. The system starts with plumbing, moving on to data tables, charts and search, through interactive reports, and building towards predictions in both batch and realtime (and defining the role for both), the deployment of predictive systems and how to iteratively improve predictions that prove valuable.
In this one day workshop, we will introduce Spark at a high level context. Spark is fundamentally different than writing MapReduce jobs so no prior Hadoop experience is needed. You will learn how to interact with Spark on the command line and conduct rapid in-memory data analyses. We will then work on writing Spark applications to perform large cluster-based analyses including SQL-like aggregations, machine learning applications, and graph algorithms. The course will be conducted in Python using PySpark.
Infrastructure-as-code: bridging the gap between Devs and OpsMykyta Protsenko
Ops are overwhelmed with support. Devs are mad because their cannot deploy the changes as fast as they want. Sounds familiar?
Infrastructure-as-code can make your life easier by empowering developers and reducing operations' routine toil. It can cut down the lead time for infrastructure provisioning from hours or even days to minutes.
This talk reviews several IaC tools and approaches, showing how to integrate them into continuous delivery pipeline. It covers the problems and challenges that engineers may face while working with infrastructure-as-code tools and provides a few hands-on recipes to address them.
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
ETL is the first phase when building a big data processing platform. Data is available from various sources and formats, and transforming the data into a compact binary format (Parquet, ORC, etc.) allows Apache Spark to process it in the most efficient manner. This talk will discuss common issues and best practices for speeding up your ETL workflows, handling dirty data, and debugging tips for identifying errors.
Speakers: Kyle Pistor & Miklos Christine
This talk was originally presented at Spark Summit East 2017.
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Similar to Data Summer Conf 2018, “Mist – Serverless proxy for Apache Spark (RUS)” — Vadim Chelyshov, Software engineer at Provectus (20)
Looking to make your document processing operations more effective and cost-efficient with AI/ML? Learn from the experts of Provectus and Amazon Web Services (AWS) how to choose the right solution for your company! We will look into the management and engineering perspectives of AI document processing, from industry use cases and the solution map to our unique methodology for assessing available document processing solutions to Provectus IDP. Whether you are looking for a ready-made solution or you plan to build a custom solution of your own, this webinar will help you find the best option for your business.
Agenda
- Introductions
- Industry use cases
- Intelligent Document Processing (IDP) overview
- IDP Solutions map
- AWS IDP Solution
- Provectus IDP Platform
- Q&A
Intended Audience
Technology executives and decision makers, including such roles as CIO, CCO, COO, and CDO; digital transformation managers; data and ML engineers.
Presenters
Almir Davletov, IDP Subject Matter Expert, Provectus
Yaroslav Tarasyuk, Business Development, Provectus
Sonali Sahu, Sr. Solutions Architect, AWS
Interested? Learn more about Provectus Intelligent Document Processing Solution: https://provectus.com/document-processing-solution/
Intelligent Document Processing in Healthcare. Choosing the Right Solutions.Provectus
Healthcare organizations generate piles of documents and forms in different formats, making it difficult to achieve operational excellence and streamline business processes. Manual entry and OCR are no longer viable, and healthcare entities are looking for new solutions to handle documents.
In this presentation you can learn about:
- Healthcare document types and use cases
- IDP framework: building blocks for document processing solutions
- The document processing market landscape
- Methodology for solution evaluation: comparing apples to apples
Whether you are looking for a ready-made solution or plan to build a custom solution of your own, this webinar will help you find the best fit for your healthcare use cases.
Choosing the Right Document Processing Solution for Healthcare OrganizationsProvectus
Looking to automate document processing in your healthcare organization? Learn from Provectus & AWS experts how to make data capture, conversion, and analytics more efficient. Process and manage documents faster and on a larger scale with AI & Machine Learning.
In this presentation, we offer management and engineering perspectives on document processing with AI, to help you explore available options. Whether you are looking for a ready-made solution or plan to build a custom solution of your own, this webinar will help you find the best fit for your healthcare use cases.
MLOps and Data Quality: Deploying Reliable ML Models in ProductionProvectus
Looking to build a robust machine learning infrastructure to streamline MLOps? Learn from Provectus experts how to ensure the success of your MLOps initiative by implementing Data QA components in your ML infrastructure.
For most organizations, the development of multiple machine learning models, their deployment and maintenance in production are relatively new tasks. Join Provectus as we explain how to build an end-to-end infrastructure for machine learning, with a focus on data quality and metadata management, to standardize and streamline machine learning life cycle management (MLOps).
Agenda
- Data Quality and why it matters
- Challenges and solutions of Data Testing
- Challenges and solutions of Model Testing
- MLOps pipelines and why they matter
- How to expand validation pipelines for Data Quality
AI Stack on AWS: Amazon SageMaker and BeyondProvectus
Looking to learn more about AWS AI stack? Join experts from Provectus & AWS to find out how to use Amazon SageMaker (with combination with other tools and services) to enable enterprise-wide AI.
Companies are looking to scale and become more productive when it comes to AI and data initiatives. They seek to launch AI projects more rapidly, which, among many other factors, requires a robust machine learning infrastructure. In this webinar, you will learn how to create a canonical SageMaker workflow, expand the SageMaker workflow to a holistic implementation, enhance and expand the implementation using best practices for feature store, data versioning, ML pipeline orchestration, and model monitoring.
Agenda
- Introductions
- Amazon SageMaker Overview
- Real-World Use Case
- Data Lake for Machine Learning
- Amazon SageMaker Experiments
- Orchestration Beyond SageMaker Experiments
- Amazon SageMaker Debugger
- Amazon SageMaker Model Monitor
- Webinar Takeaways
Intended audience
Technology executives & decision makers, manager-level tech roles, data engineers & data scientists, ML practitioners & ML engineers, and developers
Presenters
- Stepan Pushkarev, Chief Technology Officer, Provectus
- Pritpal Sahota, Technical Account Manager, Provectus
- Christopher A. Burns, Sr. AI/ML Solution Architect, AWS
Feel free to share this presentation with your colleagues and don't hesitate to reach out to us at info@provectus.com if you have any questions!
REQUEST WEBINAR: https://provectus.com/ai-stack-on-aws-sagemaker-and-beyond-mar-2020/
Feature Store as a Data Foundation for Machine LearningProvectus
Looking to design and build a centralized, scalable Feature Store for your Data Science & Machine Learning teams to take advantage of? Come and learn from experts of Provectus and Amazon Web Services (AWS) how to!
Feature Store is a key component of the ML stack and data infrastructure, which enables feature engineering and management. By having a Feature Store, organizations can save massive amounts of resources, innovate faster, and drive ML processes at scale. In this webinar, you will learn how to build a Feature Store with a data mesh pattern and see how to achieve consistency between real-time and training features, to improve reproducibility with time-traveling for data.
Agenda
- Modern Data Lakes & Modern ML Infrastructure
- Existing and Emerging Architectural Shifts
- Feature Store: Overview and Reference Architecture
- AWS Perspective on Feature Store
Intended Audience
Technology executives & decision makers, manager-level tech roles, data architects & analysts, data engineers & data scientists, ML practitioners & ML engineers, and developers
Presenters
- Stepan Pushkarev, Chief Technology Officer, Provectus
- Gandhi Raketla, Senior Solutions Architect, AWS
- German Osin, Senior Solutions Architect, Provectus
Feel free to share this presentation with your colleagues and don't hesitate to reach out to us at info@provectus.com if you have any questions!
REQUEST WEBINAR: https://provectus.com/webinar-feature-store-as-data-foundation-for-ml-nov-2020/
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
Looking to implement MLOps using AWS services and Kubeflow? Come and learn about machine learning from the experts of Provectus and Amazon Web Services (AWS)!
Businesses recognize that machine learning projects are important but go beyond just building and deploying models, which is mostly done by organizations. Successful ML projects entail a complete lifecycle involving ML, DevOps, and data engineering and are built on top of ML infrastructure.
AWS and Amazon SageMaker provide a foundation for building infrastructure for machine learning while Kubeflow is a great open source project, which is not given enough credit in the AWS community. In this webinar, we show how to design and build an end-to-end ML infrastructure on AWS.
Agenda
- Introductions
- Case Study: GoCheck Kids
- Overview of AWS Infrastructure for Machine Learning
- Provectus ML Infrastructure on AWS
- Experimentation
- MLOps
- Feature Store
Intended Audience
Technology executives & decision makers, manager-level tech roles, data engineers & data scientists, ML practitioners & ML engineers, and developers
Presenters
- Stepan Pushkarev, Chief Technology Officer, Provectus
- Qingwei Li, ML Specialist Solutions Architect, AWS
Feel free to share this presentation with your colleagues and don't hesitate to reach out to us at info@provectus.com if you have any questions!
REQUEST WEBINAR: https://provectus.com/webinar-mlops-and-reproducible-ml-on-aws-with-kubeflow-and-sagemaker-aug-2020/
Cost Optimization for Apache Hadoop/Spark Workloads with Amazon EMRProvectus
Considering new ways and options for reducing operational costs and scaling flexibility of your Apache Hadoop/Spark? Try migrating to Amazon EMR!
On-premises Apache Hadoop/Spark clusters are among the top sources of financial pressure for businesses. IT organizations want to reduce spend while still meeting demand, to keep their legacy data applications up and running. Come and learn from experts at Provectus & AWS how you can use Amazon EMR to start driving cost efficiencies in your organization!
Agenda
- Hadoop market and cost optimizations using Amazon EMR
- Cost related and other challenges of on-prem Hadoop clusters
- Cost optimizations by using Amazon EMR and migration best practices
Intended audience
Technology executives & decision makers, manager-level tech roles, data engineers & data scientists, and developers
Presenters
- Stepan Pushkarev, Chief Technology Officer, Provectus
- Pritpal Sahota, Technical Account Manager, Provectus
- Nirav Shah, Senior Solutions Architect, AWS
- Perry Peterson, Business Development Manager, AWS
Feel free to share this presentation with your colleagues and don't hesitate to reach out to us at info@provectus.com if you have any questions!
REQUEST WEBINAR: https://provectus.com/cost-optimization-for-apache-hadoop-spark-workloads-with-amazon-emr-june-2020/
ODSC webinar "Kubeflow, MLFlow and Beyond — augmenting ML delivery" Stepan Pu...Provectus
What's a machine learning workflow? What open source tools can you use to automate ML workflow?
Reproducible ML pipelines in research and production with monitoring insights from live inference clusters could enable and accelerate the delivery of AI solutions for enterprises. There is a growing ecosystem of tools that augment researchers and machine learning engineers in their day to day operations.
Still, there are big gaps in the machine learning workflow when it comes to training dataset versioning, training performance and metadata tracking, integration testing, inferencing quality monitoring, bias detection, concept drift detection and other aspects that prevent the adoption of AI in organizations of all sizes.
"Building a Modern Data platform in the Cloud", Alex Casalboni, AWS Dev Day K...Provectus
AWS Dev Day Kyiv 2019
Track: Analytics & Machine Learning
Session: "Building a Modern Data platform in the Cloud"
Speaker: Alex Casalboni, AWS Technical Evangelist
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/HIDnAG9AxZo
"How to build a global serverless service", Alex Casalboni, AWS Dev Day Kyiv ...Provectus
AWS Dev Day Kyiv 2019
Track: Modern Application Development
Session: "How to build a global serverless service"
Speaker: Alex Casalboni, AWS Technical Evangelist
Level: 400
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/Q19B-NTkMfk
"Automating AWS Infrastructure with PowerShell", Martin Beeby, AWS Dev Day Ky...Provectus
AWS Dev Day Kyiv 2019
Track: Backend & Architecture
Session: "Automating AWS Infrastructure with PowerShell"
Speaker: Martin Beeby, AWS Principle Evangelist
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/rgIjjK2J4dQ
"Analyzing your web and application logs", Javier Ramirez, AWS Dev Day Kyiv 2...Provectus
AWS Dev Day Kyiv 2019
Track: Analytics & Machine Learning
Session: "Analyzing your web and application logs"
Speaker: Javier Ramirez, AWS Technical Evangelist
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/IpEhEs1sXeg
"Resiliency and Availability Design Patterns for the Cloud", Sebastien Storma...Provectus
AWS Dev Day Kyiv 2019
Track: Backend & Architecture
Session: "Resiliency and Availability Design Patterns for the Cloud"
Speaker: Sebastien Stormacq, AWS Technical Evangelist
Level: 400
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/O8gonQCJawU
"Architecting SaaS solutions on AWS", Oleksandr Mykhalchuk, AWS Dev Day Kyiv ...Provectus
AWS Dev Day Kyiv 2019
Track: Backend & Architecture
Session: ""Architecting SaaS solutions on AWS""
Speaker: Oleksandr Mykhalchuk, Director of DevOps & Cloud Services at Softserve
Level: 300
Video: https://youtu.be/3lKoe-ts8Qs
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
"Developing with .NET Core on AWS", Martin Beeby, AWS Dev Day Kyiv 2019Provectus
AWS Dev Day Kyiv 2019
Track: Modern Application Development
Session: "Developing with .NET Core on AWS"
Speaker: Martin Beeby, AWS Principle Evangelist
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/OzM8L7H1LmA
"How to build real-time backends", Martin Beeby, AWS Dev Day Kyiv 2019Provectus
AWS Dev Day Kyiv 2019
Track: Backend & Architecture
Session: "How to build real-time backends"
Speaker: Martin Beeby, AWS Principle Evangelist
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://youtu.be/bsZYA6V3bDA
"Integrate your front end apps with serverless backend in the cloud", Sebasti...Provectus
AWS Dev Day Kyiv 2019
Track: Modern Application Development
Session: "Integrate your front end apps with serverless backend in the cloud"
Speaker: Sebastien Stormacq, AWS Technical Evangelist
Level: 200
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://www.youtube.com/watch?v=6z43H11qoU8&t=1s
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019Provectus
AWS Dev Day Kyiv 2019
Track: Analytics & Machine Learning
Session: ""Scaling ML from 0 to millions of users""
Speaker: Julien Simon, Global AI & Machine Learning Evangelist at AWS
Level: 300
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
Video: https://www.youtube.com/watch?v=N73u1mx9DqY
How to implement authorization in your backend with AWS IAMProvectus
AWS Dev Day Kyiv 2019
Track: Backend & Architecture
Session: ""How to implement authorization in your backend with AWS IAM""
Speaker: Stas Ivaschenko, AWS solutions architect at Provectus
Level: 400
Video: https://www.youtube.com/watch?v=4Jje_WJ4V7Q
AWS Dev Day is a free, full-day technical event where new developers will learn about some of the hottest topics in cloud computing, and experienced developers can dive deep on newer AWS services.
Provectus has organized AWS Dev Day Kyiv in close collaboration with Amazon Web Services: 800+ participants, 18 sessions, 3 tracks, a really AWSome Day!
Now, together with Zeo Alliance, we're building and nurturing AWS User Group Ukraine — join us on Facebook to stay updated about cloud technologies and AWS services: https://www.facebook.com/groups/AWSUserGroupUkraine
"
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
5. Getting Big Data Projects to Production Is aGetting Big Data Projects to Production Is a
ChallengeChallenge
Only 15 percent of businesses reported deploying their
big data project to production, effectively unchanged
from last year (14 percent).
https://www.gartner.com/newsroom/id/3466117
6. Data engineers vs. data scientistsData engineers vs. data scientists
A common starting point is 2-3 data engineers for every
data scientist. For some organizations with more
complex data engineering requirements, this can be 4-5
data engineers per data scientist
https://www.oreilly.com/ideas/data-engineers-vs-
data-scientists
9. The lack of right tooling?The lack of right tooling?
10. Getting Spark Projects to ProductionGetting Spark Projects to Production
How to
provide a way to run our job?
return a result?
report an error?
run multiply jobs in parallel?
13. Web applicationsWeb applications
Java servlet - 1997
public class NewServlet extends HttpServlet {
@Override
protected void doGet(
HttpServletRequest request,
HttpServletResponse response) throws ServletException, IOException {
...
}
}
14. Big data processingBig data processing
Hadoop - 2005
Apache Spark - 2011
object MyApp {
def main(arg: String[]): Unit = {
val conf = new SparkConf().setAppName(...)
val sc = new SparkContext(conf)
...
}
}
./bin/spark-submit ...
19. What a hell is going on!
These thing aren't
about big data!
20. It isn't normal development experinceIt isn't normal development experince
A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
21. Just a special service!Just a special service!
Artifacts/settings CRUD
API for launching jobs and receiving their status
Spark driver launcher
A library do describe operations using Spark
25. What is Mist?What is Mist?
It's a combinations of
Programming framework
HTTP/Async interface to run deploy and run spark
programms
Spark contexts / driver manager
26. ConceptsConcepts
Function
user code with Spark program to be deployed on Mist
Artifact
file (.jar or .py) that contains a Function
Context
settings for Mist Worker where Function is being
executed
Job
a result of the Function execution
27. Mist instancesMist instances
Mist Worker
Spark driver application which invokes functions
Mist Master
exposes http/async api
stores functions/artifacts/contexts
run/manage workers
run functions on workers
35. A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
39. ContextContext
Mist Context describes parameters of Mist worker and
Spark context
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
spark.submit.deployMode = cluster
spark.executor.instances = 2
spark.executor.cores = 2
spark.executor.memory = 1G
}
}
40. Mist instancesMist instances
Mist Worker
receives req -> invokes functions -> returns result
back to master
Mist Master
queue requests
sends them on workers
47. Worker modesWorker modes
Exclusive
starts new worker for every request
Shared
reuses worker instance
model = Context
name = cluster_ctx
data {
precreated = false
max-parallel-jobs = 2
worker-mode = "shared"
// or
worker-mode = "exlsusive"
...
}
48. Multi clusterMulti cluster
Contexts may be configured to work on different
clusters
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = yarn
}
}
model = Context
name = cluster_ctx
data {
precreated = false
spark-conf {
spark.master = spark://
50. A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces
52. Pi EstimationPi Estimation
def main(arg: Array[String]): Unit = {
val count = sc.parallelize(1 to samples).filter { _ =>
val x = math.random
val y = math.random
x*x + y*y < 1
}.count()
println(s"Pi is roughly ${4.0 * count / samples}")
}
53. Pi EstimationPi Estimation
What signature is better?
def estimatePi1(samples: Int): Unit
// vs
def estimatePi2(samples: Int): Double
54. Word count againWord count again
def main(arg: Array[String]): Unit = {
val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
}
55. Word count againWord count again
What signature is better?
def wordCount(from: String, to: String): Unit
// vs
def wordCount(from: String, to: String): Map[String, Int]
// or
sealed trait OutputFile
case class HdfsFile(path: String) extends OutputFile
case class S3File(path: String) extends OutputFile
def wordCount(from: String, to: String): OutputFile
58. MistFnMistFn
Context is already described
Json input instead of arguments
Json output instead of Unit
Array[String] => Unit
// vs
(Json, SparkContext) => Json
62. A lot of actions over ssh-session to test, deploy and
run your code
Improssible to run job without direct shell
command
Process more that one job in parallel is difficult
Poor interfaces