Agile data science with scala

•

7 likes•1,821 views

How the data science pipelines have to evolve and how it'll be accessible using the right technologies from Scala and the Spark Notebook.

Technology

Agile Data Science with Scala
by @DataFellas
Xavier Tordoir
xtordoir@data-fellas.guru
@xtordoir
Andy Petrella
noootsab@data-fellas.guru
@noootsab

Data Fellas
Andy Petrella
Maths
Geospatial
Distributed Computing
Spark Notebook
Trainer Spark/Scala
Machine Learning
Xavier Tordoir
Physics
Bioinformatics
Distributed Computing
Scala (& Perl)
trainer Spark
Machine Learning

© Data Fellas SPRL 2016
● Pipeline: productizing Data Science
● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook)
● Why Micro Services?
● Painful points:
○ Data science is Discontiguous
○ Context Lost in Translation
● Solution: Data Fellas’ Agile Data Science Toolkit
Lineup
So if you’re not sure you want to stay...

© Data Fellas SPRL 2016
Pipeline
Productizing Data Science
Modelling Coding Deploying
Finding Data
Parsing structures
Cleaning
(Reducing)
Learning
Predicting
Connect PROD data
Tuning training parameters
Create Prediction Service
Generate Deployable
Connect to PROD infrastructure
Integration with existing env
Allocate (schedule) resources
Ensure availability

© Data Fellas SPRL 2016
Distributed Data Science
Demo
All-In Spark Notebooks
Get data: Source → Kafka
Prepare View: Kafka → Cassandra
Train Model: Cassandra → ML...
Create Server: Cassandra/ML/... → Akka Http
Create Client: Json → Html Form, Chart, table, ...

© Data Fellas SPRL 2016
Bad Pipeline
Targeting Dashboard
Modelling Coding Deploying Dashboard
»»»
Data Scientist focusing on the dashboard/report instead of content
breaks reusability of data
time wasted on learning viz instead of increasing accuracy (or velocity)
monolithic instead of service oriented

© Data Fellas SPRL 2016
Extended Pipeline
Micro Services
Modelling Coding Deploying Integrating
Application
Creating
Services
Abstracts access to prepared views
Exposes Prediction capabilities
Highly horizontally scalable
Scaling micro services cluster
→ cheaper than computing cluster
Customer integration
Can be any technologies
Can even be another pipeline!

© Data Fellas SPRL 2016
Painful points
Data science is Discontiguous
➔ Highly heterogeneous environment
➔ Too many friction areas
➔ Time to market too long
Modelling Coding Deploying Integrating
Application
Scientist Data Eng. Ops. Eng. Web Eng. Customers
➔ No integration
➔ Error prone
➔ Schedule delays
Creating
Services
Frictions
Result: Lack of Agility
Collecting
Data Eng.

© Data Fellas SPRL 2016
Painful points
Context Lost in Translation
Data Lake Processing
Machine
Learning
Model
Output
Data
Input
Data
No contextual discovery No quality info
No lineage
(origin of the data)
Link to
process and
input discarded
Huge gap in architecture:
binary and schema aware
serving layer
Accuracy depends on
concealed quality of inputs
No schema!
hard and long integration,
poor satisfaction
Moreover:
No backward links → no agility and no context awareness
Result: Lack of Reproducibility
Application

© Data Fellas SPRL 2016
Our Approach
Agile Data Science Toolkit
Automatic
Semantics
Engine
+ Autogenerated
Microservices
Integrated
End-to-End
Environment
Huge gain
in Time and
Reliability
+ =
Notebook
Computing
Cluster
Access
Layer
Knowledge
Base
Consumers
Customers
Exposes
database,
learning models,
stream sources,
notebooks, ...
data type
process
lineage
usage
Easy to Release
Easy to (Re)Use
Notebook
Version Control
(Git)
Spark Job Project
(SBT)
Service Projects
(SBT)
Metadata
(Doc, Logic, Schema, ...)
Catalog
(ElasticSearch)
Deployable
(Jar, Docker)
Repository
(Nexus, Docker Repo,
Pypi, Gem Server)
Client Projects
(Node.Js, Java, Scala,
Python, Ruby)
Publishable
(NPM, Jar,
Pip/EasyInstall, Gem)
scientist
data
Engineer
ops
Engineer

© Data Fellas SPRL 2016
Agile Data Science Toolkit
In a nutshell

© Data Fellas SPRL 2016
O’Reilly
Online seminar

© Data Fellas SPRL 2016
Growing
We’re Hiring! http://www.data-fellas.guru/#skillsjobs

Q/A
References
http://www.data-fellas.guru/
http://spark-notebook.io/
https://github.com/andypetrella/spark-notebook/
https://gitter.im/andypetrella/spark-notebook
Come at Strata
-- London at least
-- We have two talks :-)

Distributed Data Science… * A genomics use case * Spark Notebook * Interactive Distributed Data Science Distributed Data Science… Pipeline * Pipeline: productizing Data Science * Demo of Distributed Pipeline (ADAM, Akka, Cassandra, Parquet, Spark) * Why Micro Services? * Painful points: * Data science is Discontiguous * Context Lost in Translation * Solution: Data Fellas’ Agile Data Science Toolkit

Towards a rebirth of data science (by Data Fellas)

Andy Petrella

Nowadays, Data Science is buzzing all over the place. But what is a, so-called, Data Scientist? Some will argue that a Data Scientist is a person able to report and present insights in a data set. Others will say that a Data Scientist can handle a high throughput of values and expose them in services. Yet another definition includes the capacity to create meaningful visualizations on the data. However, we enter an age where velocity is a key. Not only the velocity of your data is high, but the time to market is shortened. Hence, the time separating the moment you receive a set of data and the time you’ll be able to deliver added value is crucial. In this talk, we’ll review the legacy Data Science methodologies, what it meant in terms of delivered work and results. Afterwards, we’ll slightly move towards different concepts, techniques and tools that Data Scientists will have to learn and appropriate in order to accomplish their tasks in the age of Big Data. The dissertation is closed by exposing the Data Fellas view on a solution to the challenges, specially thanks to the Spark Notebook and the Shar3 product we develop.

Distributed machine learning 101 using apache spark from a browser devoxx.b...

Andy Petrella

What is a distributed data science pipeline. how with apache spark and friends.

Andy Petrella

Scala: the unpredicted lingua franca for data science

Andy Petrella

Data Science with Spark

Krishna Sankar

Multiplatform Spark solution for Graph datasources by Javier Dominguez

Big Data Spain

Architecture in action 01

Krishna Sankar

Does more data always improve ML models? Is it better to use distributed ML instead of single node ML? In this talk I will show that while more data often improves DL models in high variance problem spaces (with semi or unstructured data) such as NLP, image, video more data does not significantly improve high bias problem spaces where traditional ML is more appropriate. Additionally, even in the deep learning domain, single node models can still outperform distributed models via transfer learning. Data scientists have pain points running many models in parallel automating the experimental set up. Getting others (especially analysts) within an organization to use their models Databricks solves these problems using pandas udfs, ml runtime and MLflow.

Deep Learning with MXNet - Dmitry Larko

Sri Ambati

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...

Rodney Joyce

Number 2 in the Data Science for Dummies series - We'll predict Titanic survival with Databricks, python and MLSpark. These are the slides only (excuse the Powerpoint animation issues) - check out the actual tech talk on YouTube: https://rodneyjoyce.home.blog/2019/05/03/data-science-for-dummies-machine-learning-with-databricks-python-sparkml-tech-talk-1-of-7/) If you have not used Databricks before check out the first talk - Databricks for Dummies. Here's the rest of the series: https://rodneyjoyce.home.blog/tag/data-science-for-dummies/ 1) Data Science overview with Databricks 2) Titanic survival prediction with Azure Machine Learning Studio + Kaggle 3) Data Engineering with Titanic dataset + Databricks + Python 4) Titanic with Databricks + Spark ML 5) Titanic with Databricks + Azure Machine Learning Service 6) Titanic with Databricks + MLS + AutoML 7) Titanic with Databricks + MLFlow 8) Titanic with .NET Core + ML.NET 9) Deployment, DevOps/MLOps and Productionisation

IBM Strategy for Spark

Mark Kerzner

Analyzing Data With Python

Sarah Guido

CuRious about R in Power BI? End to end R in Power BI for beginners

Jen Stirrup

In this session, we will start R right from the beginning, from installing R through to datatransformation and integration, through to visualizing data by using R in PowerBI. Then, we will move towards powerful but simple to use datatypes in R such as data frames. We will also upgrade our data analysis skills by looking at Rdata transformation using a powerful set of tools to make things simple: the tidyverse. Then, we will look at integrating our R work into Power BI, and visualizing our data using beautiful visualizations with R and Power BI. Finally, we will share our work by publishing our Power BI project, with our R code, to the Power BI service. We will also look at refreshing our dataset so that our new dashboard has refreshed data. This session is aimed at getting beginners up to speed as gently and quickly as possible. Join this session if you are curious about R and want to know more. If you are already a Power BI expert, join this session to open up a whole new world of Power BI to add toyour skill set. If you are new to Power BI, you will still get value from this session since you'll be able to see a Power BI dashboard being built in an end-to-end solution.

Machine Learning and Hadoop

Josh Patterson

Building Better Analytics Workflows (Strata-Hadoop World 2013)

Wes McKinney

EDHREC @ Data Science MD

Donald Miner

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...

Big Data Spain

Stacked Ensembles in H2O

Sri Ambati

Ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. Practitioners may prefer ensemble algorithms when model performance is valued above other factors such as model complexity and training time. The Super Learner algorithm, also called "stacking", learns the optimal combination of the base learner fits. The latest version of H2O now contains a "Stacked Ensemble" method, which allows the user to stack H2O models into a Super Learner. The Stacked Ensemble method is the the native H2O version of stacking, previously only available in the h2oEnsemble R package, and now enables stacking from all the H2O APIs: Python, R, Scala, etc. Erin is a Statistician and Machine Learning Scientist at H2O.ai. Before joining H2O, she was the Principal Data Scientist at Wise.io (acquired by GE Digital) and Marvin Mobile Security (acquired by Veracode) and the founder of DataScientific, Inc. Erin received her Ph.D. from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing.

Skutil - H2O meets Sklearn - Taylor Smith

Sri Ambati

Skutil brings the best of both worlds to H2O and sklearn, delivering an easy transition into the world of distributed computing that H2O offers, while providing the same, familiar interface that sklearn users have come to know and love. - Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai - To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata

Big Data Analytics with Storm, Spark and GraphLab

Impetus Technologies

Use of standards and related issues in predictive analytics

Paco Nathan

Better {ML} Together: GraphLab Create + Spark

Turi, Inc.

Spark - Philly JUG

Brian O'Neill

Distributed Deep Learning + others for Spark Meetup

Vijay Srinivas Agneeswaran, Ph.D

Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Paige_Roberts

ODSC East virtual presentation - The best machine learning, and advanced analytics projects are often stopped when it comes time to move into large scale production, preventing them from ever impacting the business in a meaningful way. Hundreds of hours of work may never get put to use. Python is rapidly becoming the language of choice for scientists and researchers of many types to build, test, train and score models. But when data science models need to go into production, challenges of performance and scale can be a huge roadblock. By combining a Python application with an underlying massively parallel (MPP) database, Python users can achieve a simplified path to production. An MPP database also allows you to do data preparation and data analysis at far greater speeds, accelerating development and testing as well as production performance. It also allows greater numbers of concurrent jobs to run, while also continuously loading data for IoT or other streaming use cases. Analyze data in the database where it sits, rather than first moving it to another framework, then analyzing it, then moving the results, taking multiple performance hits from both CPU and IO for every move and transformation. In this talk, you will learn about combination architectures that can get your work into production, shorten development time, and provide the performance and scale advantages of an MPP database with the convenience and power of Python. Use case examples use the open source Vertica-Python project created by Uber with contributions from Twitter, Palantir, Etsy, Vertica, Kayak and Gooddata.

Part 1: Introducing the Cloudera Data Science Workbench

Cloudera, Inc.

What's hot

Big Data is changing abruptly, and where it is likely heading

Paco Nathan

Machine Learning with Spark

elephantscale

Introduction to Analytics with Azure Notebooks and Python

Jen Stirrup

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

Databricks

Deep Learning with MXNet - Dmitry Larko

Sri Ambati

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...

Rodney Joyce

IBM Strategy for Spark

Mark Kerzner

Analyzing Data With Python

Sarah Guido

CuRious about R in Power BI? End to end R in Power BI for beginners

Jen Stirrup

Machine Learning and Hadoop

Josh Patterson

Building Better Analytics Workflows (Strata-Hadoop World 2013)

Wes McKinney

EDHREC @ Data Science MD

Donald Miner

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...

Big Data Spain

Stacked Ensembles in H2O

Sri Ambati

Skutil - H2O meets Sklearn - Taylor Smith

Sri Ambati

Big Data Analytics with Storm, Spark and GraphLab

Impetus Technologies

Use of standards and related issues in predictive analytics

Paco Nathan

Better {ML} Together: GraphLab Create + Spark

Turi, Inc.

Spark - Philly JUG

Brian O'Neill

Distributed Deep Learning + others for Spark Meetup

Vijay Srinivas Agneeswaran, Ph.D

What's hot (20)

Big Data is changing abruptly, and where it is likely heading

Machine Learning with Spark

Introduction to Analytics with Azure Notebooks and Python

Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas

Deep Learning with MXNet - Dmitry Larko

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...

IBM Strategy for Spark

Analyzing Data With Python

CuRious about R in Power BI? End to end R in Power BI for beginners

Machine Learning and Hadoop

Building Better Analytics Workflows (Strata-Hadoop World 2013)

EDHREC @ Data Science MD

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...

Stacked Ensembles in H2O

Skutil - H2O meets Sklearn - Taylor Smith

Big Data Analytics with Storm, Spark and GraphLab

Use of standards and related issues in predictive analytics

Better {ML} Together: GraphLab Create + Spark

Spark - Philly JUG

Distributed Deep Learning + others for Spark Meetup

Similar to Agile data science with scala

Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Paige_Roberts

Part 1: Introducing the Cloudera Data Science Workbench

Cloudera, Inc.

Scaling up with Cisco Big Data: Data + Science = Data Science

eRic Choo

AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning

Sandesh Rao

Autonomous Database is one of the hottest Oracle products where we have attempted to use Machine Learning for several aspects of the service. This presentation takes a view on our current state of Diagnostic methodology in the Autonomous Database Cloud services and how do we process this data to find anomalies in them to troubleshoot them at a scale of several petabytes a year and conduct AIOps. Some of the use cases we will cover are a Log Anomaly timeline which we reduce significant amounts of logs using semi-supervised machine learning techniques to reduce logs and match them in near real time. We will cover techniques to analyze database issues using Machine learning techniques like Kmeans , TFIDF, Random Forests, and z-scores to predict if a spike in the CPU is a normal or abnormal spike. We will also talk about RNN’s with LSTM/GRU as some of the applications of how to predict faults before they happen. Some of the other use cases are to use convolution filters to determine maintenance windows within the database workloads, determine best times to do database backups, security anomaly timelines and many others. This is a production service and this can be used if you have a customer SR/defect today. The service is much more extensive inside the Oracle Autonomous Database Cloud. This presentation will accompany several examples with how to apply these techniques, machine learning knowledge is preferred but not a prerequisite

Enabling Data centric Teams

Data Con LA

Data Con LA 2020 Description Coming from a grand belief of data democratization, I believe that in order for any team to be successful collaborators, it has to be data centric and data should be accessible to all. *To ensure that your non software or software engineering centric team has maximum efficiency, data should be visible, data lake should be accessible. *Form a database for analytics summaries, talk about the different technologies(SQL, NoSQL) cost of deployment, need, team driven structure. Build an API for this database for external/inter team crosstalk. *Build analytics and visual layer on top of it. Flask/Django/Node, etc.., to enable the team to have high visibility in their analysis, and to ensure a higher turnaround of data. *Talk about an easy way of enabling the team to run code, could be local/cloud, JupyterHub is a great way of doing so, talk about the tremendous value added in that and the potential it enables *Talk about the common tools user for version control/CICD/Coding technologies, etc.. *Finally summarize the value of the mixture of all these tools and technologies in order to ensure the maximum efficiency. Speaker Nawar Khabbaz, Rivian, Data Engineer

Breed data scientists_ A Presentation.pptx

GautamPopli1

Big Data for Data Scientists - Info Session

WeCloudData

TensorFlow 16: Building a Data Science Platform

Seldon

Apache Spark in Scientific Applications

Dr. Mirko Kämpf

Apache Spark in Scientific Applciations

Dr. Mirko Kämpf

Sparkflows.io

sparkflows

Machine Learning and AI

James Serra

DevOps for DataScience

Stepan Pushkarev

Global AI Bootcamp Madrid - Azure Databricks

Alberto Diaz Martin

Ai & Data Analytics 2018 - Azure Databricks for data scientist

Alberto Diaz Martin

AzureML TechTalk

Udaya Kumar

Analytics and Lakehouse Integration Options for Oracle Applications

Ray Février

This Red Hot session is designed for customers who are currently using Oracle Cloud applications such as Fusion and EPM, and are interested in gaining a better understanding of the integration options that are available to them. Here is a high level agenda: - We will start by discussing the modern data platform on OCI, the Lakehouse architecture and the OCI related services that supports it. - We will then discuss the data extraction methods available on OCI for Fusion and EPM. - Last but not least, we will end with a few best practices and possible use cases. In the interest of time, we will mainly focus on integration patterns that are recommended for Fusion and EPM, but don’t hesitate to reach out if you would to talk to us about other Oracle applications. Enjoy!

Introduction to Data Engineering

Durga Gadiraju

As part of this session, I will be giving an introduction to Data Engineering and Big Data. It covers up to date trends. * Introduction to Data Engineering * Role of Big Data in Data Engineering * Key Skills related to Data Engineering * Role of Big Data in Data Engineering * Overview of Data Engineering Certifications * Free Content and ITVersity Paid Resources Don't worry if you miss the video - you can click on the below link to go through the video after the schedule. https://youtu.be/dj565kgP1Ss * Upcoming Live Session - Overview of Big Data Certifications (Spark Based) - https://www.meetup.com/itversityin/events/271739702/ Relevant Playlists: * Apache Spark using Python for Certifications - https://www.youtube.com/playlist?list=PLf0swTFhTI8rMmW7GZv1-z4iu_-TAv3bi * Free Data Engineering Bootcamp - https://www.youtube.com/playlist?list=PLf0swTFhTI8pBe2Vr2neQV7shh9Rus8rl * Join our Meetup group - https://www.meetup.com/itversityin/ * Enroll for our labs - https://labs.itversity.com/plans * Subscribe to our YouTube Channel for Videos - http://youtube.com/itversityin/?sub_confirmation=1 * Access Content via our GitHub - https://github.com/dgadiraju/itversity-books * Lab and Content Support using Slack

Embedded-ml(ai)applications - Bjoern Staender

Dataconomy Media

Data Science and CDSW

Jason Hubbard

Similar to Agile data science with scala (20)

Python + MPP Database = Large Scale AI/ML Projects in Production Faster

Part 1: Introducing the Cloudera Data Science Workbench

Scaling up with Cisco Big Data: Data + Science = Data Science

AUSOUG - NZOUG-GroundBreakers-Jun 2019 - AI and Machine Learning

Enabling Data centric Teams

Breed data scientists_ A Presentation.pptx

Big Data for Data Scientists - Info Session

TensorFlow 16: Building a Data Science Platform

Apache Spark in Scientific Applications

Apache Spark in Scientific Applciations

Sparkflows.io

Machine Learning and AI

DevOps for DataScience

Global AI Bootcamp Madrid - Azure Databricks

Ai & Data Analytics 2018 - Azure Databricks for data scientist

AzureML TechTalk

Analytics and Lakehouse Integration Options for Oracle Applications

Introduction to Data Engineering

Embedded-ml(ai)applications - Bjoern Staender

Data Science and CDSW

More from Andy Petrella

Data Observability Best Pracices

Andy Petrella

How to Build a Global Data Mapping

Andy Petrella

Interactive notebooks

Andy Petrella

Governance compliance

Andy Petrella

Data science governance and GDPR

Andy Petrella

Data science governance : what and how

Andy Petrella

Spark Summit Europe: Share and analyse genomic data at scale

Andy Petrella

Leveraging mesos as the ultimate distributed data science platform

Andy Petrella

Data Enthusiasts London: Scalable and Interoperable data services. Applied to...

Andy Petrella

Data science requires so many skills, people and time before the results can be accessed. Moreover, these results cannot be static anymore. And finally, the Big Data comes to the plate and the whole tool chain needs to change. In this talk Data Fellas introduces Shar3, a tool kit aiming to bridged the gaps to build a interactive distributed data processing pipeline, or loop! Then the talk covers genomics nowadays problems including data types, processing, discovery by introducing the GA4GH initiative and its implementation using Shar3.

Spark meetup london share and analyse genomic data at scale with spark, adam...

Andy Petrella

Genomics and Health data is nowadays one of the hot topics requiring lots of computations and specially machine learning. This helps science with a very relevant societal impact to get even better outcome. That is why Apache Spark and its ADAM library is a must have. This talk will be twofold. First, we'll show how Apache Spark, MLlib and ADAM can be plugged all together to extract information from even huge and wide genomics dataset. Everything will be packed into examples from the Spark Notebook, showing how bio-scientists can work interactively with such a system. Second, we'll explain how these methodologies and even the datasets themselves can be shared at very large scale between remote entities like hospitals or laboratories using micro services leveraging Apache Spark, ADAM, Play Framework 2, Avro and Tachyon.

Distributed machine learning 101 using apache spark from the browser

Andy Petrella

Liège créative: Open Science

Andy Petrella

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

Andy Petrella

What is Distributed Computing, Why we use Apache Spark

Andy Petrella

Spark devoxx2014Andy Petrella

Lightning fast genomics with Spark, Adam and Scala

Andy Petrella

We are at a time where biotech allow us to get personal genomes for $1000. Tremendous progress since the 70s in DNA sequencing have been done, e.g. more samples in an experiment, more genomic coverages at higher speeds. Genomic analysis standards that have been developed over the years weren't designed with scalability and adaptability in mind. In this talk, we’ll present a game changing technology in this area, ADAM, initiated by the AMPLab at Berkeley. ADAM is framework based on Apache Spark and the Parquet storage. We’ll see how it can speed up a sequence reconstruction to a factor 150.

Machine Learning and GraphXAndy Petrella

Quanti-litative Revolution in GIS

Andy Petrella

Scala and-fp-in-big-data

Andy Petrella

Software Crafted And Libraries Available

Andy Petrella

More from Andy Petrella (20)

Data Observability Best Pracices

How to Build a Global Data Mapping

Interactive notebooks

Governance compliance

Data science governance and GDPR

Data science governance : what and how

Spark Summit Europe: Share and analyse genomic data at scale

Leveraging mesos as the ultimate distributed data science platform

Data Enthusiasts London: Scalable and Interoperable data services. Applied to...

Spark meetup london share and analyse genomic data at scale with spark, adam...

Distributed machine learning 101 using apache spark from the browser

Liège créative: Open Science

BioBankCloud: Machine Learning on Genomics + GA4GH @ Med at Scale

What is Distributed Computing, Why we use Apache Spark

Spark devoxx2014

Lightning fast genomics with Spark, Adam and Scala

Machine Learning and GraphX

Quanti-litative Revolution in GIS

Scala and-fp-in-big-data

Software Crafted And Libraries Available

Recently uploaded

Neuro-symbolic is not enough, we need neuro-*semantic*

Frank van Harmelen

Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”. All of this illustrated with link prediction over knowledge graphs, but the argument is general.

The Future of Platform Engineering

Jemma Hussein Allen

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

Product School

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Generating a custom Ruby SDK for your web service or Rails API using Smithy

g2nightmarescribd

Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.

Accelerate your Kubernetes clusters with Varnish Caching

Thijs Feryn

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Leading Change strategies and insights for effective change management pdf 1.pdf

OnBoard

Epistemic Interaction - tuning interfaces to provide information for AI support

Alan Dix

Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024 https://alandix.com/academic/papers/synergy2024-epistemic/ As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

UiPathCommunity

💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™: See how to accelerate model training and optimize model performance with active learning Learn about the latest enhancements to out-of-the-box document processing – with little to no training required Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath. Speakers: 👨‍🏫 Andras Palfi, Senior Product Manager, UiPath 👩‍🏫 Lenka Dulovicova, Product Program Manager, UiPath

GraphRAG is All You need? LLM & Knowledge Graph

Guy Korland

Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs. 1. Unifying Large Language Models and Knowledge Graphs: A Roadmap. https://arxiv.org/abs/2306.08302 2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs: https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

FIDO Alliance

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

DanBrown980551

Do you want to learn how to model and simulate an electrical network from scratch in under an hour? Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)! During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook. PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides: - A fully editable and extendable library for grid component modelling; - Visualization tools to display your network; - Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses; The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well. What you will learn during the webinar: - For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills; - For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

Inflectra

In this insightful webinar, Inflectra explores how artificial intelligence (AI) is transforming software development and testing. Discover how AI-powered tools are revolutionizing every stage of the software development lifecycle (SDLC), from design and prototyping to testing, deployment, and monitoring. Learn about: • The Future of Testing: How AI is shifting testing towards verification, analysis, and higher-level skills, while reducing repetitive tasks. • Test Automation: How AI-powered test case generation, optimization, and self-healing tests are making testing more efficient and effective. • Visual Testing: Explore the emerging capabilities of AI in visual testing and how it's set to revolutionize UI verification. • Inflectra's AI Solutions: See demonstrations of Inflectra's cutting-edge AI tools like the ChatGPT plugin and Azure Open AI platform, designed to streamline your testing process. Whether you're a developer, tester, or QA professional, this webinar will give you valuable insights into how AI is shaping the future of software delivery.

The Art of the Pitch: WordPress Relationships and Sales

Laura Byrne

Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes? All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.

FIDO Alliance Osaka Seminar: Overview.pdf

FIDO Alliance

Essentials of Automations: Optimizing FME Workflows with Parameters

Safe Software

Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place. Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects. Here’s what you’ll gain: - Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows. - Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy. - Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency. - Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity. We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic. Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Prayukth K V

The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development. The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers: State of global ICS asset and network exposure Sectoral targets and attacks as well as the cost of ransom Global APT activity, AI usage, actor and tactic profiles, and implications Rise in volumes of AI-powered cyberattacks Major cyber events in 2024 Malware and malicious payload trends Cyberattack types and targets Vulnerability exploit attempts on CVEs Attacks on counties – USA Expansion of bot farms – how, where, and why In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East Why are attacks on smart factories rising? Cyber risk predictions Axis of attacks – Europe Systemic attacks in the Middle East Download the full report from here: https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/

Recently uploaded (20)

Neuro-symbolic is not enough, we need neuro-*semantic*

The Future of Platform Engineering

From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Generating a custom Ruby SDK for your web service or Rails API using Smithy

Accelerate your Kubernetes clusters with Varnish Caching

PCI PIN Basics Webinar from the Controlcase Team

Leading Change strategies and insights for effective change management pdf 1.pdf

Epistemic Interaction - tuning interfaces to provide information for AI support

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...

GraphRAG is All You need? LLM & Knowledge Graph

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...

Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality

The Art of the Pitch: WordPress Relationships and Sales

FIDO Alliance Osaka Seminar: Overview.pdf

Essentials of Automations: Optimizing FME Workflows with Parameters

State of ICS and IoT Cyber Threat Landscape Report 2024 preview

Agile data science with scala

1. Agile Data Science with Scala by @DataFellas Xavier Tordoir xtordoir@data-fellas.guru @xtordoir Andy Petrella noootsab@data-fellas.guru @noootsab

2. Data Fellas Andy Petrella Maths Geospatial Distributed Computing Spark Notebook Trainer Spark/Scala Machine Learning Xavier Tordoir Physics Bioinformatics Distributed Computing Scala (& Perl) trainer Spark Machine Learning

3. © Data Fellas SPRL 2016 ● Pipeline: productizing Data Science ● Demo of Distributed Pipeline (Spark, Mesos, Akka, Cassandra, Kafka, Spark Notebook) ● Why Micro Services? ● Painful points: ○ Data science is Discontiguous ○ Context Lost in Translation ● Solution: Data Fellas’ Agile Data Science Toolkit Lineup So if you’re not sure you want to stay...

4. © Data Fellas SPRL 2016 Pipeline Productizing Data Science Modelling Coding Deploying Finding Data Parsing structures Cleaning (Reducing) Learning Predicting Connect PROD data Tuning training parameters Create Prediction Service Generate Deployable Connect to PROD infrastructure Integration with existing env Allocate (schedule) resources Ensure availability

5. © Data Fellas SPRL 2016 Distributed Data Science Demo All-In Spark Notebooks Get data: Source → Kafka Prepare View: Kafka → Cassandra Train Model: Cassandra → ML... Create Server: Cassandra/ML/... → Akka Http Create Client: Json → Html Form, Chart, table, ...

6. © Data Fellas SPRL 2016 Bad Pipeline Targeting Dashboard Modelling Coding Deploying Dashboard »»» Data Scientist focusing on the dashboard/report instead of content breaks reusability of data time wasted on learning viz instead of increasing accuracy (or velocity) monolithic instead of service oriented

7. © Data Fellas SPRL 2016 Extended Pipeline Micro Services Modelling Coding Deploying Integrating Application Creating Services Abstracts access to prepared views Exposes Prediction capabilities Highly horizontally scalable Scaling micro services cluster → cheaper than computing cluster Customer integration Can be any technologies Can even be another pipeline!

8. © Data Fellas SPRL 2016 Painful points Data science is Discontiguous ➔ Highly heterogeneous environment ➔ Too many friction areas ➔ Time to market too long Modelling Coding Deploying Integrating Application Scientist Data Eng. Ops. Eng. Web Eng. Customers ➔ No integration ➔ Error prone ➔ Schedule delays Creating Services Frictions Result: Lack of Agility Collecting Data Eng.

9. © Data Fellas SPRL 2016 Painful points Context Lost in Translation Data Lake Processing Machine Learning Model Output Data Input Data No contextual discovery No quality info No lineage (origin of the data) Link to process and input discarded Huge gap in architecture: binary and schema aware serving layer Accuracy depends on concealed quality of inputs No schema! hard and long integration, poor satisfaction Moreover: No backward links → no agility and no context awareness Result: Lack of Reproducibility Application

10. Data Fellas… Agile Data Science Toolkit

11. © Data Fellas SPRL 2016 Our Approach Agile Data Science Toolkit Automatic Semantics Engine + Autogenerated Microservices Integrated End-to-End Environment Huge gain in Time and Reliability + = Notebook Computing Cluster Access Layer Knowledge Base Consumers Customers Exposes database, learning models, stream sources, notebooks, ... data type process lineage usage Easy to Release Easy to (Re)Use Notebook Version Control (Git) Spark Job Project (SBT) Service Projects (SBT) Metadata (Doc, Logic, Schema, ...) Catalog (ElasticSearch) Deployable (Jar, Docker) Repository (Nexus, Docker Repo, Pypi, Gem Server) Client Projects (Node.Js, Java, Scala, Python, Ruby) Publishable (NPM, Jar, Pip/EasyInstall, Gem) scientist data Engineer ops Engineer

18. Data Fellas… Announcements!!!

21. Q/A References http://www.data-fellas.guru/ http://spark-notebook.io/ https://github.com/andypetrella/spark-notebook/ https://gitter.im/andypetrella/spark-notebook Come at Strata -- London at least -- We have two talks :-)

Agile data science with scala

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Agile data science with scala

Similar to Agile data science with scala (20)

More from Andy Petrella

More from Andy Petrella (20)

Recently uploaded

Recently uploaded (20)

Agile data science with scala