The document discusses Random Forest and Decision Trees in Spark MLlib. It provides an overview of Spark and MLlib, describes Decision Trees and Random Forest algorithms in MLlib, and demonstrates them through Jupyter notebooks using golf and Titanic datasets. The speaker then discusses parameters, advantages, and limitations of Decision Trees and Random Forest models in MLlib.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.
First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling.
"
Talk given at the London Semantic Web Meetup February 17th 2015. Discusses projects that aim to bridge the gap between the RDF and the Hadoop ecosystems primarily focusing on Apache Jena Elephas.
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise.
A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.
This session covers how to work with PySpark interface to develop Spark applications. From loading, ingesting, and applying transformation on the data. The session covers how to work with different data sources of data, apply transformation, python best practices in developing Spark Apps. The demo covers integrating Apache Spark apps, In memory processing capabilities, working with notebooks, and integrating analytics tools into Spark Applications.
Updates from Project Hydrogen: Unifying State-of-the-Art AI and Big Data in A...Databricks
"Project Hydrogen is a major Apache Spark initiative to bring state-of-the-art AI and Big Data solutions together. It contains three major projects: 1) barrier execution mode 2) optimized data exchange and 3) accelerator-aware scheduling. A basic implementation of barrier execution mode was merged into Apache Spark 2.4.0, and the community is working on the latter two. In this talk, we will present progress updates to Project Hydrogen and discuss the next steps.
First, we will review the barrier execution mode implementation from Spark 2.4.0. It enables developers to embed distributed training jobs properly on a Spark cluster. We will demonstrate distributed AI integrations built on top it, e.g., Horovod and Distributed TensorFlow. We will also discuss the technical challenges to implement those integrations and future work. Second, we will outline on-going work for optimized data exchange. Its target scenario is distributed model inference. We will present how we do performance testing/profiling, where the bottlenecks are, and how to improve the overall throughput on Spark. If time allows, we might also give updates on accelerator-aware scheduling.
"
Talk given at the London Semantic Web Meetup February 17th 2015. Discusses projects that aim to bridge the gap between the RDF and the Hadoop ecosystems primarily focusing on Apache Jena Elephas.
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training**
This Edureka tutorial on PySpark Tutorial will provide you with a detailed and comprehensive knowledge of Pyspark, how it works, the reason why python works best with Apache Spark. You will also learn about RDDs, data frames and mllib.
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well.
Indeed, existing SPARQL-on-Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data.
In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
A talk given at SemTechBiz 2014 in San Jose that follows up on the tool originally presented at the 2012 conference. Talks about the limitations we've encountered with the original tool and how we've evolved it to address these and build a more robust general purpose and open source SPARQL testing tool.
The tool is available on SourceForge in pre-built form at http://sourceforge.net/projects/sparql-query-bm/ or as code on SourceForge or GitHub (https://github.com/rvesse/sparql-query-bm)
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
Apache Spark has rocked the big data landscape, quickly becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk we introduce the Spark programming model, and describe some unique Java runtime capabilities in the JIT, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. We will show how solutions, previously infeasible with regular Java programming, become possible with a high performance Spark core runtime, enabling you to solve problems smarter and faster.
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
Generating high quality dating recommendations using advanced analytics, streaming data pipelines, machine learning, graph analytics, and text processing.
Use the latest Spark libraries including Spark SQL, Data Frames, BlinkDB, Spark Streaming, MLlib, and GraphX as well as Twitter's Algebird for sketch algorithms, probabilistic data structures, and approximations.
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L227PI
This CloudxLab Spark MLlib tutorial helps you to understand Spark MLlib in detail. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) Applications of Machine Learning
3) Machine Learning - Types & Tools
4) Introduction to Spark MLlib
5) Movie Lens Recommendation - Collaborative Filtering in Spark MLlib
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
* Title *
Spark After Dark 1.5: Deep Dive Into Latest Perf and Scale Improvements in Spark Ecosystem
* Abstract *
Combining the most popular and technically-deep material from his wildly popular Advanced Apache Spark Meetup, Chris Fregly will provide code-level deep dives into the latest performance and scalability advancements within the Apache Spark Ecosystem by exploring the following:
1) Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
2) Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
3) Tuning Spark Streaming Performance and Fault Tolerance with KafkaRDD and KinesisRDD
4) Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
5) Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
6) Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll
* Demos *
This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.
All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki
In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/
* Speaker Bio *
Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.
Previously, Chris was a Data Solutions Engineer at Databricks and a Streaming Data Engineer at Netflix.
When Chris isn’t contributing to Spark and other open source projects, he’s creating book chapters, slides, and demos to share knowledge with his peers at meetups and conferences throughout the world.
Sempala - Interactive SPARQL Query Processing on HadoopAlexander Schätzle
Driven by initiatives like Schema.org, the amount of semantically annotated data is expected to grow steadily towards massive scale, requiring cluster-based solutions to query it. At the same time, Hadoop has become dominant in the area of Big Data processing with large infrastructures being already deployed and used in manifold application fields. For Hadoop-based applications, a common data pool (HDFS) provides many synergy benefits, making it very attractive to use these infrastructures for semantic data processing as well.
Indeed, existing SPARQL-on-Hadoop (MapReduce) approaches have already demonstrated very good scalability, however, query runtimes are rather slow due to the underlying batch processing framework. While this is acceptable for data-intensive queries, it is not satisfactory for the majority of SPARQL queries that are typically much more selective requiring only small subsets of the data.
In this paper, we present Sempala, a SPARQL-over-SQL-on-Hadoop approach designed with selective queries in mind. Our evaluation shows performance improvements by an order of magnitude compared to existing approaches, paving the way for interactive-time SPARQL query processing on Hadoop.
Quadrupling your elephants - RDF and the Hadoop ecosystemRob Vesse
Presentation given at ApacheCon EU 2014 in Budapest on technologies aiming to bridge the gap between the RDF and the Hadoop ecosystems.
Talks primarily about RDF Tools for Hadoop (part of the Apache Jena) project and Intel Graph Builder (extensions to Pig)
ROCm and Distributed Deep Learning on Spark and TensorFlowDatabricks
ROCm, the Radeon Open Ecosystem, is an open-source software foundation for GPU computing on Linux. ROCm supports TensorFlow and PyTorch using MIOpen, a library of highly optimized GPU routines for deep learning. In this talk, we describe how Apache Spark is a key enabling platform for distributed deep learning on ROCm, as it enables different deep learning frameworks to be embedded in Spark workflows in a secure end-to-end machine learning pipeline. We will analyse the different frameworks for integrating Spark with Tensorflow on ROCm, from Horovod to HopsML to Databrick's Project Hydrogen. We will also examine the surprising places where bottlenecks can surface when training models (everything from object stores to the Data Scientists themselves), and we will investigate ways to get around these bottlenecks. The talk will include a live demonstration of training and inference for a Tensorflow application embedded in a Spark pipeline written in a Jupyter notebook on Hopsworks with ROCm.
Benchmark Tests and How-Tos of Convolutional Neural Network on HorovodRunner ...Databricks
The freedom of fast iterations of distributed deep learning tasks is crucial for smaller companies to gain competitive advantages and market shares from big tech giants. Horovod Runner brings this process to relatively accessible spark clusters.
PySpark Programming | PySpark Concepts with Hands-On | PySpark Training | Edu...Edureka!
** PySpark Certification Training: https://www.edureka.co/pyspark-certification-training **
This Edureka tutorial on PySpark Programming will give you a complete insight of the various fundamental concepts of PySpark. Fundamental concepts include the following:
1. PySpark
2. RDDs
3. DataFrames
4. PySpark SQL
5. PySpark Streaming
6. Machine Learning (MLlib)
Semantic Integration with Apache Jena and StanbolAll Things Open
All Things Open 2014 - Day 1
Wednesday, October 22nd, 2014
Phillip Rhodes
Founder & President of Fogbeam Labs
Big Data
Semantic Integration with Apache Jena and Stanbol
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
In the past several years, the pandas UDFs are perhaps the most important changes to Apache Spark for Python data science. However, these functionalities have evolved organically, leading to some inconsistencies and confusions among users. In Apache Spark 3.0, the pandas UDFs were redesigned by leveraging type hints.
A talk given at SemTechBiz 2014 in San Jose that follows up on the tool originally presented at the 2012 conference. Talks about the limitations we've encountered with the original tool and how we've evolved it to address these and build a more robust general purpose and open source SPARQL testing tool.
The tool is available on SourceForge in pre-built form at http://sourceforge.net/projects/sparql-query-bm/ or as code on SourceForge or GitHub (https://github.com/rvesse/sparql-query-bm)
A Java Implementer's Guide to Boosting Apache Spark Performance by Tim Ellison.J On The Beach
Apache Spark has rocked the big data landscape, quickly becoming the largest open source big data community with over 750 contributors from more than 200 organizations. Spark's core tenants of speed, ease of use, and its unified programming model fit neatly with the high performance, scalable, and manageable characteristics of modern Java runtimes. In this talk we introduce the Spark programming model, and describe some unique Java runtime capabilities in the JIT, fast networking, serialization techniques, and GPU off-loading that deliver the ultimate big data platform for solving business problems. We will show how solutions, previously infeasible with regular Java programming, become possible with a high performance Spark core runtime, enabling you to solve problems smarter and faster.
Spark After Dark: Real time Advanced Analytics and Machine Learning with SparkChris Fregly
Generating high quality dating recommendations using advanced analytics, streaming data pipelines, machine learning, graph analytics, and text processing.
Use the latest Spark libraries including Spark SQL, Data Frames, BlinkDB, Spark Streaming, MLlib, and GraphX as well as Twitter's Algebird for sketch algorithms, probabilistic data structures, and approximations.
Machine learning with Apache Spark MLlib | Big Data Hadoop Spark Tutorial | C...CloudxLab
Big Data with Hadoop & Spark Training: http://bit.ly/2L227PI
This CloudxLab Spark MLlib tutorial helps you to understand Spark MLlib in detail. Below are the topics covered in this tutorial:
1) Introduction to Machine Learning
2) Applications of Machine Learning
3) Machine Learning - Types & Tools
4) Introduction to Spark MLlib
5) Movie Lens Recommendation - Collaborative Filtering in Spark MLlib
Scalable Monitoring Using Prometheus with Apache Spark Clusters with Diane F...Databricks
As Apache Spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF: https://www.cncf.io/projects/), can be applied to monitor and archive system performance data in a containerized spark environment.
In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Updated version of "Big Data Science in Scala" featuring Spark pipelines and hyper-parameter optimization techniques. This talk presents you how three scala libraries - Smile, Saddle and Spark ML - satisfy requirements of new Big Data Science projects. Let's see it on example of click-through rate prediction.
Joker'14 Java as a fundamental working tool of the Data ScientistAlexey Zinoviev
Alexey Zinoviev presented this paper on the Jocker conference http://jokerconf.com/#zinoviev.
This paper covers next topics: Data Mining, Machine Learning, Mahout, Spark, MLlib, Python, Octave, R language
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Databricks
DeepLearning4J (DL4J) is a powerful Open Source distributed framework that brings Deep Learning to the JVM (it can serve as a DIY tool for Java, Scala, Clojure and Kotlin programmers). It can be used on distributed GPUs and CPUs. It is integrated with Hadoop and Apache Spark. ND4J is a Open Source, distributed and GPU-enabled library that brings the intuitive scientific computing tools of the Python community to the JVM. Training neural network models using DL4J, ND4J and Spark is a powerful combination, but it presents some unexpected issues that can compromise performance and nullify the benefits of well written code and good model design. In this talk I will walk through some of those problems and will present some best practices to prevent them, coming from lessons learned when putting things in production.
SCALABLE MONITORING USING PROMETHEUS WITH APACHE SPARKzmhassan
As spark applications move to a containerized environment, there are many questions about how to best configure server systems in the container world. In this talk we will demonstrate a set of tools to better monitor performance and identify optimal configuration settings. We will demonstrate how Prometheus, a project that is now part of the Cloud Native Computing Foundation (CNCF), can be applied to monitor and archive system performance data in a containerized spark environment. In our examples, we will gather spark metric output through Prometheus and present the data with Grafana dashboards. We will use our examples to demonstrate how performance can be enhanced through different tuned configuration settings. Our demo will show how to configure settings across the cluster as well as within each node.
Alpine is constantly innovating, ever since the founding of the company based on in-database analytics that went far beyond traditional, in-memory, code-based desktop applications. This initial innovation built on the work of the MADlib team at Greenplum/Pivotal, ultimately inspired by the work of Joe Hellerstein’s team at UC Berkeley. The team then made all of this functionality available in a simple web interface, which enabled enterprise collaboration and a team-based approach to analytics. Later on, Alpine released its first support for Hadoop, enabling complex analytics on Hadoop without any coding, taking care of all the complexity of MapReduce and Hadoop configuration. Most recently, Alpine has been building new capabilities on top of Spark, to offer Hadoop users a new level of performance and scale.
Using Apache Spark with IBM SPSS Modeler with Dr. Steve Poulin.
An introduction to Apache Spark and its relevant integration with IBM SPSS Modeler. Why integrate? What type of benefits?
A review the integration process high level and advise which enhanced features to pay attention to, and common pitfalls to avoid.
BigDL webinar - Deep Learning Library for SparkDESMOND YUEN
BigDL is a distributed deep learning library for Apache Spark*
and a unified Big Data Platform Driving Analytics and Data Science.
If you like what you read be sure you ♥ it below. Thank you!
In the session, we discussed the End-to-end working of Apache Spark that mainly focused on "Why What and How" factors. We discussed RDDs and the high-level API like Dataframe, Dataset. The session also took care of the internals of Spark.
Data Science and Deep Learning on Spark with 1/10th of the Code with Roope As...Databricks
Scalability and interactivity make Spark an excellent platform for data scientists who want to analyze very large datasets and build predictive models. However, the productivity of data scientists is hampered by lack of abstractions for building models for diverse types of data. For example, processing text or image data requires low level data coercion and transformation steps, which are not easy to compose into complex workflows for production applications. There is also a lack of domain specific libraries, for example for computer vision and image processing.
We present an open-source Spark library which simplifies common data science tasks such as feature construction and hyperparameter tuning, and allows data scientists to iterate and experiment on their models faster. The library integrates seamlessly with SparkML pipeline object model, and is installable through spark-packages.
The library brings deep learning and image processing to Spark through CNTK, OpenCV and Tensorflow in frictionless manner, thus enabling scenarios such training on GPU-enabled nodes, deep neural net featurization and transfer learning on large image datasets. We discuss the design and architecture of the library, and show examples of building a machine learning models for image classification.
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIVladimir Iglovikov, Ph.D.
Presented by Vladimir Iglovikov:
- https://www.linkedin.com/in/iglovikov/
- https://x.com/viglovikov
- https://www.instagram.com/ternaus/
This presentation delves into the journey of Albumentations.ai, a highly successful open-source library for data augmentation.
Created out of a necessity for superior performance in Kaggle competitions, Albumentations has grown to become a widely used tool among data scientists and machine learning practitioners.
This case study covers various aspects, including:
People: The contributors and community that have supported Albumentations.
Metrics: The success indicators such as downloads, daily active users, GitHub stars, and financial contributions.
Challenges: The hurdles in monetizing open-source projects and measuring user engagement.
Development Practices: Best practices for creating, maintaining, and scaling open-source libraries, including code hygiene, CI/CD, and fast iteration.
Community Building: Strategies for making adoption easy, iterating quickly, and fostering a vibrant, engaged community.
Marketing: Both online and offline marketing tactics, focusing on real, impactful interactions and collaborations.
Mental Health: Maintaining balance and not feeling pressured by user demands.
Key insights include the importance of automation, making the adoption process seamless, and leveraging offline interactions for marketing. The presentation also emphasizes the need for continuous small improvements and building a friendly, inclusive community that contributes to the project's growth.
Vladimir Iglovikov brings his extensive experience as a Kaggle Grandmaster, ex-Staff ML Engineer at Lyft, sharing valuable lessons and practical advice for anyone looking to enhance the adoption of their open-source projects.
Explore more about Albumentations and join the community at:
GitHub: https://github.com/albumentations-team/albumentations
Website: https://albumentations.ai/
LinkedIn: https://www.linkedin.com/company/100504475
Twitter: https://x.com/albumentations
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Unlocking Productivity: Leveraging the Potential of Copilot in Microsoft 365, a presentation by Christoforos Vlachos, Senior Solutions Manager – Modern Workplace, Uni Systems
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
Threats to mobile devices are more prevalent and increasing in scope and complexity. Users of mobile devices desire to take full advantage of the features
available on those devices, but many of the features provide convenience and capability but sacrifice security. This best practices guide outlines steps the users can take to better protect personal devices and information.
Pushing the limits of ePRTC: 100ns holdover for 100 daysAdtran
At WSTS 2024, Alon Stern explored the topic of parametric holdover and explained how recent research findings can be implemented in real-world PNT networks to achieve 100 nanoseconds of accuracy for up to 100 days.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
3. Agenda
● Spark and MLLib Overview
● Decision Trees in Spark MlLib
● Random Forest in Spark MlLib
● Demo
4. What is MLlib?
● MLlib is Spark’s machine learning (ML) library.
● Its goal is to make practical machine learning scalable and easy.
● ML algorithms include Classification , Regression , Decision Trees and
random Forests, Recommendation, Clustering,Topic Modeling (LDA),
Distributed linear Algebra(SVD,PCA) and many more.
https://spark.apache.org/docs/latest/ml-guide.html
6. Spark MlLib
● Initially developed by MLbase team in AmpLab (UC berkeley) - supported only Scala and Java
● Shipped to spark v0.8 ( Sep 2013)
● Current release Spark 2.2.0 released (Jul 11, 2017)
● MLLib 2.2:
○ DataFrame-based Api becomes the primary API for MLlib
■ org.apache.spark.ml and pysprak.ml
○ Ease of use
■ Java, Scala, Python, R
■ interoperates with NumPy in Python
■ Any Hadoop data source (e.g. HDFS, HBase, or local files), making it easy to plug into Hadoop workflows.
7. Spark Mllib in production
● ML persistence
○ Saving and loading
● Pipeline MLlib standardizes APIs for machine learning algorithms to make it easier to combine multiple algorithms into a
single pipeline, or workflow. This section covers the key concepts introduced by the Pipelines API, where the pipeline concept is
mostly inspired by the scikit-learn project.
Some concepts[5]
● DataFrame
● Transformer - one dataframe to another
● Estimator - fits on a dataframe to produce a transformer
● Pipeline A pipeline chains multiple transformers and Estimators to specify ML workflow.
8. MLlib 2.2.0 API
1. MLlib: RDD-based API ( maintanace mode)
2. MLlib: DataFrame-based API
Why is MLlib switching to the DataFrame-based API?
● DataFrames provide a more user-friendly API than RDDs.
a. The many benefits of DataFrames include Spark Datasources,
b. SQL/DataFrame queries, Tungsten and Catalyst optimizations,
c. uniform APIs across languages.
● The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
● DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.
What is “Spark ML”?
● “Spark ML” is not an official name but occasionally used to refer to the MLlib DataFrame-based API.
https://spark.apache.org/docs/2.2.0/ml-guide.html
9. Agenda
● Spark and MLLib Overview
● Decision Tree in Spark MlLib
● Random Forest in Spark MlLib
● Demo
14. Decison Tree
hyper-parameters for decision tree in MlLib
● numClasses: How many classes are we trying to classify?
● categoricalFeaturesInfo: A specification whereby we declare what features are categorical features and should
not be treated as numbers
● impurity: A measure of the homogeneity of the labels at the node. Currently in Spark, there are two measures
of impurity with respect to classification: Gini and Entropy
● maxDepth: A stopping criterion which limits the depth of constructed trees. Generally, deeper trees lead to
more accurate results but run the risk of overfitting.
15. Decision Trees are prone to Overfitting
Reference: Machine Learning - Decision Trees and Random Forests - by Loonycorn
16. Spark MLlib -Decision Tree Example Notebook
1. Notebook using small dataset golf play
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMllibPyspark.golf.ipynb
image:http://i2.cdn.turner.com/dr/pga/sites/default/files/articles/bro-hof-17th-bubba-072411-640x360.jpg?1311698162
17. Agenda
● Spark and MLLib Overview
● Decision Tree in Spark MlLib
● Random Forest in Spark MlLib
● Demo
19. Random Forest
The name “Random Forest” comes from combining the randomness that is used
to pick the subset of data with having a bunch of Decision trees.
A random forest (RF) is a collection of tree predictors
f ( x, T, Θk ), k = 1,…,K
where the Θk are i.i.d random vectors.
The trees are combined by
● voting (for classification)
● averaging (for regression).
21. Random Forest - Out of bag Error
Out-of-bag Error Estimate
● Average over the cases within each class to get a classwise out-of-bag error
rate.
● Average over all cases to get an overall out-of-bag error rate.
In random forests, there is no need for cross-
validation to get an unbiased estimate of the
test set error. It is estimated internally,
23. Random Forest - parameters
RandomForest - some parameters ( older RDD based but some apply to
dataframe based)
● numTrees: Number of trees in the resulting forest. Increasing the number of trees
decreases model variance.
● featureSubsetStrategy: Specifies a method which produces a number of how many
features are selected for training a single tree.
● seed: Seed for random generator initialization, since RandomForest depends on
random selection of features and rows
24. Random Forest -parameters
Spark also provides additional parameters to stop tree growing and produce fine-grained trees:
● minInstancesPerNode: A node is not split anymore, if it would provide left or right nodes which
would contain smaller number of observations than the value specified by this parameter. Default value is 1,
but typically for regression problems or large trees, the value should be higher.
● minInfoGain: Minimum information gain a split must get. Default value is 0.0.
25. MLlib - Labeled point vector (RDD based)
Labeled point vector
● Prior to running any supervised machine learning algorithm using Spark MLlib, we must convert our dataset
into a labeled point vector.
○ val higgs = response.zip(features).map {
case (response, features) =>
LabeledPoint(response, features) }
higgs.setName("higgs").cache()
● An example of a labeled point vector follows:
(1.0, [0.123, 0.456, 0.567, 0.678, ..., 0.789])
26. MLlib - data caching
Data caching
Many machine learning algorithms are iterative in nature and thus require multiple passes over the data.
Spark provides a way to persist the data in case we need to iterate over it. Spark also publishes several
StorageLevels to allow storing data with various options:
● NONE: No caching at all
● MEMORY_ONLY: Caches RDD data only in memory
● DISK_ONLY: Write cached RDD data to a disk and releases from memory
27. Making sense of a tragedy - titanic dataset
image:http://www.oceangate.com/images/expeditions/titanic/titanic-sinking-wikimedia-commons.jpg
● Notebook using titanic dataset and MLlib spark dataframe based apis for
Decision Tree and Random Forest
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Spar
kMlLibTitanicNewDFbasedAPI.ipynb
28. Notebooks
1. Notebook using small dataset golf play
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMllibPyspark.golf.ipynb
2. Notebook using titanic dataset and MLlib spark dataframe based apis for
Decision Tree and Random Forest
http://nbviewer.jupyter.org/github/tuhinmahmud/sigkdd_austin/blob/master/Sp
arkMlLibTitanicNewDFbasedAPI.ipynb