The document describes the Cloudera Data Science Challenge, which involves solving three data science problems using large datasets. For the first problem, Smartfly, the goal is to predict flight delays using historical flight data and machine learning algorithms like logistic regression and SVM. The second problem, Almost Famous, involves statistical analysis of web log data and filtering for spam. The third problem, Winklr, requires analyzing a social network graph to recommend users to follow. The document discusses the approaches, tools, and results for each problem.
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
ML is great fun, but now we want it to solve real problems. To do this, we need a way of keeping track of all of our data and models, and we need to know when our models fail and why. This talk will cover how to move ML to production with TensorFlow Extended (TFX). TFX is used by Google internally for machine-learning model development and deployment, and it has recently been made public. TFX consists of multiple pipeline elements and associated components, and this talk will cover them all, but three elements are particularly interesting: TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If Tool.
The TensorFlow Data Validation library analyses incoming data and computes distributions over the feature values. This can show us which features many not be useful, maybe because they always have the same value, or which features may contain bugs. TensorFlow Model Analysis allows us to understand how well our data performs on different slices of the data. For example, we may find that our predictive models are more accurate for events that happen on Tuesdays, and such knowledge can be used to help us better understand our data and our business. The What-If Tool is as an interactive tool that allows you to change data and see what the model would say if a particular record had a particular feature value. It lets you probe your model, and it can automatically find the closest record with a different predicted label, which allows you to learn what the model is homing in on. Machine learning is growing up.
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
Readers will gain an appreciation for machine learning, and take away valuable strategies including:
• What is machine learning.
• How it’s changing the world.
• Who the major players are.
• How you can control it.
Machine learning. It’s in the news. It’s discussed in corporate boardrooms. It’s on your mind. ML algorithms seem to be at once everywhere, yet nowhere. Can we possibly understand how this invisible force is shaping our world? How will it reform your industry, and change your job?
Machine Learning has become a must to improve insight, quality and time to market. But it's also been called the 'high interest credit card of technical debt' with challenges in managing both how it's applied and how its results are consumed.
Heuristic design of experiments w meta gradient searchGreg Makowski
Once you have started learning about predictive algorithms, and the basic knowledge discovery in databases process, what is the next level of detail to learn for a consulting project?
* Give examples of the many model training parameters
* Track results in a "model notebook"
* Use a model metric that combines both accuracy and generalization to rank models
* How to strategically search over the model training parameters - use a gradient descent approach
* One way to describe an arbitrarily complex predictive system is by using sensitivity analysis
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
ML is great fun, but now we want it to solve real problems. To do this, we need a way of keeping track of all of our data and models, and we need to know when our models fail and why. This talk will cover how to move ML to production with TensorFlow Extended (TFX). TFX is used by Google internally for machine-learning model development and deployment, and it has recently been made public. TFX consists of multiple pipeline elements and associated components, and this talk will cover them all, but three elements are particularly interesting: TensorFlow Data Validation, TensorFlow Model Analysis, and the What-If Tool.
The TensorFlow Data Validation library analyses incoming data and computes distributions over the feature values. This can show us which features many not be useful, maybe because they always have the same value, or which features may contain bugs. TensorFlow Model Analysis allows us to understand how well our data performs on different slices of the data. For example, we may find that our predictive models are more accurate for events that happen on Tuesdays, and such knowledge can be used to help us better understand our data and our business. The What-If Tool is as an interactive tool that allows you to change data and see what the model would say if a particular record had a particular feature value. It lets you probe your model, and it can automatically find the closest record with a different predicted label, which allows you to learn what the model is homing in on. Machine learning is growing up.
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Machine Learning: Understanding the Invisible Force Changing Our WorldKen Tabor
Readers will gain an appreciation for machine learning, and take away valuable strategies including:
• What is machine learning.
• How it’s changing the world.
• Who the major players are.
• How you can control it.
Machine learning. It’s in the news. It’s discussed in corporate boardrooms. It’s on your mind. ML algorithms seem to be at once everywhere, yet nowhere. Can we possibly understand how this invisible force is shaping our world? How will it reform your industry, and change your job?
Data Science in the Real World: Making a Difference Srinath Perera
We use the terms “Big Data” and “Data Science” for use of data processing to make sense of the world around us. Spanning many fields, Big Data brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture.
These usecases use basic analytics, advanced statistical methods, and predictive technologies like Machine Learning. However, it is not just about crunching the data. Some usecases like Urban Planning can be slow, and there is enough time to process the data. However, with use cases like traffic, patient monitoring, surveillance the the value of results degrades much faster with time and needs results within milliseconds to seconds. Collecting data from many sources, cleaning them up, processing them using computation clusters, and doing all these fast is a major challenge.
This talk will discuss motivation behind big data and data science and how it can make a difference. Then it will discuss the challenges, systems, and methodologies for implementing and sustaining a data science pipeline.
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Explainable AI - making ML and DL models more interpretableAditya Bhattacharya
Abstract –
Although industries have started to adopt AI and Machine Learning in almost every sector to solve complex business problems, but are these models always trustworthy? Machine Learning models are not any oracle but rather are scientific methods and mathematical models which best describes the data. But science is all about explaining complex natural phenomena in the simplest way possible! So, can we make ML and DL models more interpretable, so that any business user can understand these models and trust the results of these models?
In order to find out the answer, please join me in this session, in which I will take about concepts of Explainable AI and discuss its necessity and principles which help us demystify black-box AI models. I will be discussing about popular approaches like Feature Importance, Key Influencers, Decomposition trees used in classical Machine Learning interpretable. We will discuss about various techniques used for Deep Learning model interpretations like Saliency Maps, Grad-CAMs, Visual Attention Maps and finally go through more details about frameworks like LIME, SHAP, ELI5, SKATER, TCAV which helps us to make Machine Learning and Deep Learning models more interpretable, trustworthy and useful!
To download please go to: http://www.intelligentmining.com/knowledge-base.html
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on April 1, 2010 (no joke!) :)
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
What atmospheric data will help you predict if it's going to rain, snow, or be windy? What position should that new athlete play? How well can you guess a person's demographic background, based on their chat activity? These are all classification problems -- trying to pick the right category or label for an entity, based on observable features. They can also be solved with machine learning.
Interest in Neural networks is growing with many areas from image recognition to speech processing reporting impressive results. Applications in Natural language processing with Neural networks have found multiple applications. With advances in software and hardware technologies, and interest in AI based applications growing, it is time to understand neural networks applied to natural language processing better!
In this workshop, we will discuss the basics of neural networks and natural language processing and discuss how neural approaches differ from traditional natural language modeling techniques with practical applications.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Introduction to machine learning. Basics of machine learning. Overview of machine learning. Linear regression. logistic regression. cost function. Gradient descent. sensitivity, specificity. model selection.
In this Lunch & Learn session, Chirag Jain gives us a friendly & gentle introduction to Machine Learning & walks through High-Level Learning frameworks using Linear Classifiers.
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
This is my solution for the Cloudera Data Science Challenge 3. I use Spark MLLib for problem1, and Spark GraphX for problem3. Problem2 is "simple" streaming map-reduce.
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Data Science in the Real World: Making a Difference Srinath Perera
We use the terms “Big Data” and “Data Science” for use of data processing to make sense of the world around us. Spanning many fields, Big Data brings together technologies like Distributed Systems, Machine Learning, Statistics, and Internet of Things together. It is a multi-billion-dollar industry including use cases like targeted advertising, fraud detection, product recommendations, and market surveys. With new technologies like Internet of Things (IoT), these use cases are expanding to scenarios like Smart Cities, Smart health, and Smart Agriculture.
These usecases use basic analytics, advanced statistical methods, and predictive technologies like Machine Learning. However, it is not just about crunching the data. Some usecases like Urban Planning can be slow, and there is enough time to process the data. However, with use cases like traffic, patient monitoring, surveillance the the value of results degrades much faster with time and needs results within milliseconds to seconds. Collecting data from many sources, cleaning them up, processing them using computation clusters, and doing all these fast is a major challenge.
This talk will discuss motivation behind big data and data science and how it can make a difference. Then it will discuss the challenges, systems, and methodologies for implementing and sustaining a data science pipeline.
Machine learning is a type of artificial intelligence (AI) that provides computers with the ability to learn without being explicitly programmed. Machine learning focuses on the development of computer programs that can teach themselves to grow and change when exposed to new data.
Data Science, Machine Learning and Neural NetworksBICA Labs
Lecture briefly overviewing state of the art of Data Science, Machine Learning and Neural Networks. Covers main Artificial Intelligence technologies, Data Science algorithms, Neural network architectures and cloud computing facilities enabling the whole stack.
Explainable AI - making ML and DL models more interpretableAditya Bhattacharya
Abstract –
Although industries have started to adopt AI and Machine Learning in almost every sector to solve complex business problems, but are these models always trustworthy? Machine Learning models are not any oracle but rather are scientific methods and mathematical models which best describes the data. But science is all about explaining complex natural phenomena in the simplest way possible! So, can we make ML and DL models more interpretable, so that any business user can understand these models and trust the results of these models?
In order to find out the answer, please join me in this session, in which I will take about concepts of Explainable AI and discuss its necessity and principles which help us demystify black-box AI models. I will be discussing about popular approaches like Feature Importance, Key Influencers, Decomposition trees used in classical Machine Learning interpretable. We will discuss about various techniques used for Deep Learning model interpretations like Saliency Maps, Grad-CAMs, Visual Attention Maps and finally go through more details about frameworks like LIME, SHAP, ELI5, SKATER, TCAV which helps us to make Machine Learning and Deep Learning models more interpretable, trustworthy and useful!
To download please go to: http://www.intelligentmining.com/knowledge-base.html
Slides as presented by Alex Lin to the NYC Predictive Analytics Meetup group: http://www.meetup.com/NYC-Predictive-Analytics/ on April 1, 2010 (no joke!) :)
Using Graph Algorithms for Advanced Analytics - Part 5 ClassificationTigerGraph
What atmospheric data will help you predict if it's going to rain, snow, or be windy? What position should that new athlete play? How well can you guess a person's demographic background, based on their chat activity? These are all classification problems -- trying to pick the right category or label for an entity, based on observable features. They can also be solved with machine learning.
Interest in Neural networks is growing with many areas from image recognition to speech processing reporting impressive results. Applications in Natural language processing with Neural networks have found multiple applications. With advances in software and hardware technologies, and interest in AI based applications growing, it is time to understand neural networks applied to natural language processing better!
In this workshop, we will discuss the basics of neural networks and natural language processing and discuss how neural approaches differ from traditional natural language modeling techniques with practical applications.
Fairly Measuring Fairness In Machine LearningHJ van Veen
We look at a case and two research papers on measuring discrimination in machine learning models for extending credit. Presentation given as part of the Sao Paulo Machine Learning Meetup, theme "Ethics in Data Science".
Introduction to machine learning. Basics of machine learning. Overview of machine learning. Linear regression. logistic regression. cost function. Gradient descent. sensitivity, specificity. model selection.
In this Lunch & Learn session, Chirag Jain gives us a friendly & gentle introduction to Machine Learning & walks through High-Level Learning frameworks using Linear Classifiers.
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
This is my solution for the Cloudera Data Science Challenge 3. I use Spark MLLib for problem1, and Spark GraphX for problem3. Problem2 is "simple" streaming map-reduce.
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
Despite the growing abundance of powerful tools, building and deploying machine-learning frameworks into production continues to be major challenge, in both science and industry. I'll present some particular pain points and cautions for practitioners as well as recent work addressing some of the nagging issues. I advocate for a systems view, which, when expanded beyond the algorithms and codes to the organizational ecosystem, places some interesting constraints on the teams tasked with development and stewardship of ML products.
About: Dr. Joshua Bloom is an astronomy professor at the University of California, Berkeley where he teaches high-energy astrophysics and Python for data scientists. He has published over 250 refereed articles largely on time-domain transients events and telescope/insight automation. His book on gamma-ray bursts, a technical introduction for physical scientists, was published recently by Princeton University Press. He is also co-founder and CTO of wise.io, a startup based in Berkeley. Josh has been awarded the Pierce Prize from the American Astronomical Society; he is also a former Sloan Fellow, Junior Fellow at the Harvard Society, and Hertz Foundation Fellow. He holds a PhD from Caltech and degrees from Harvard and Cambridge University.
Lecture 5 from the COSC 426 Graduate course on Augmented Reality. This lecture talks about AR development tools and interaction styles. Taught by Mark Billinghurst from the HIT Lab NZ at the University of Canterbury. August 9th 2013
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
Using popular data science tools such as Python and R, the book offers many examples of real-life applications, with practice ranging from small to big data.
This is our contributions to the Data Science projects, as developed in our startup. These are part of partner trainings and in-house design and development and testing of the course material and concepts in Data Science and Engineering. It covers Data ingestion, data wrangling, feature engineering, data analysis, data storage, data extraction, querying data, formatting and visualizing data for various dashboards.Data is prepared for accurate ML model predictions and Generative AI apps
This is our project work at our startup for Data Science. This is part of our internal training and focused on data management for AI, ML and Generative AI apps
Certification Study Group - Professional ML Engineer Session 3 (Machine Learn...gdgsurrey
Dive into the essentials of ML model development, processes, and techniques to combat underfitting and overfitting, explore distributed training approaches, and understand model explainability. Enhance your skills with practical insights from a seasoned expert.
Microservices, containers, and machine learningPaco Nathan
http://www.oscon.com/open-source-2015/public/schedule/detail/41579
In this presentation, an open source developer community considers itself algorithmically. This shows how to surface data insights from the developer email forums for just about any Apache open source project. It leverages advanced techniques for natural language processing, machine learning, graph algorithms, time series analysis, etc. As an example, we use data from the Apache Spark email list archives to help understand its community better; however, the code can be applied to many other communities.
Exsto is an open source project that demonstrates Apache Spark workflow examples for SQL-based ETL (Spark SQL), machine learning (MLlib), and graph algorithms (GraphX). It surfaces insights about developer communities from their email forums. Natural language processing services in Python (based on NLTK, TextBlob, WordNet, etc.), gets containerized and used to crawl and parse email archives. These produce JSON data sets, then we run machine learning on a Spark cluster to find out insights such as:
* What are the trending topic summaries?
* Who are the leaders in the community for various topics?
* Who discusses most frequently with whom?
This talk shows how to use cloud-based notebooks for organizing and running the analytics and visualizations. It reviews the background for how and why the graph analytics and machine learning algorithms generalize patterns within the data — based on open source implementations for two advanced approaches, Word2Vec and TextRank The talk also illustrates best practices for leveraging functional programming for big data.
Spark Based Distributed Deep Learning Framework For Big Data Applications Humoyun Ahmedov
Deep Learning architectures, such as deep neural networks, are currently the hottest emerging areas of data science, especially in Big Data. Deep Learning could be effectively exploited to address some major issues of Big Data, such as fast information retrieval, data classification, semantic indexing and so on. In this work, we designed and implemented a framework to train deep neural networks using Spark, fast and general data flow engine for large scale data processing, which can utilize cluster computing to train large scale deep networks. Training Deep Learning models requires extensive data and computation. Our proposed framework can accelerate the training time by distributing the model replicas, via stochastic gradient descent, among cluster nodes for data resided on HDFS.
Lessons Learned from Building Machine Learning Software at NetflixJustin Basilico
Talk from Software Engineering for Machine Learning Workshop (SW4ML) at the Neural Information Processing Systems (NIPS) 2014 conference in Montreal, Canada on 2014-12-13.
Abstract:
Building a real system that incorporates machine learning as a part can be a difficult effort, both in terms of the algorithmic and engineering challenges involved. In this talk I will focus on the engineering side and discuss some of the practical issues we’ve encountered in developing real machine learning systems at Netflix and some of the lessons we’ve learned over time. I will describe our approach for building machine learning systems and how it comes from a desire to balance many different, and sometimes conflicting, requirements such as handling large volumes of data, choosing and adapting good algorithms, keeping recommendations fresh and accurate, remaining responsive to user actions, and also being flexible to accommodate research and experimentation. I will focus on what it takes to put machine learning into a real system that works in a feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. I will address the particular software engineering challenges that we’ve faced in running our algorithms at scale in the cloud. I will also mention some simple design patterns that we’ve fond to be useful across a wide variety of machine-learned systems.
Monitoring AI applications with AI
The best performing offline algorithm can lose in production. The most accurate model does not always improve business metrics. Environment misconfiguration or upstream data pipeline inconsistency can silently kill the model performance. Neither prodops, data science or engineering teams are skilled to detect, monitor and debug such types of incidents.
Was it possible for Microsoft to test Tay chatbot in advance and then monitor and adjust it continuously in production to prevent its unexpected behaviour? Real mission critical AI systems require advanced monitoring and testing ecosystem which enables continuous and reliable delivery of machine learning models and data pipelines into production. Common production incidents include:
Data drifts, new data, wrong features
Vulnerability issues, malicious users
Concept drifts
Model Degradation
Biased Training set / training issue
Performance issue
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines.
It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Technical part of the talk will cover the following topics:
Automatic Data Profiling
Anomaly Detection
Clustering of inputs and outputs of the model
A/B Testing
Service Mesh, Envoy Proxy, trafic shadowing
Stateless and stateful models
Monitoring of regression, classification and prediction models
Data Summer Conf 2018, “Monitoring AI with AI (RUS)” — Stepan Pushkarev, CTO ...Provectus
In this demo based talk we discuss a solution, tooling and architecture that allows machine learning engineer to be involved in delivery phase and take ownership over deployment and monitoring of machine learning pipelines. It allows data scientists to safely deploy early results as end-to-end AI applications in a self serve mode without assistance from engineering and operations teams. It shifts experimentation and even training phases from offline datasets to live production and closes a feedback loop between research and production.
Data Workflows for Machine Learning - Seattle DAMLPaco Nathan
First public meetup at Twitter Seattle, for Seattle DAML:
http://www.meetup.com/Seattle-DAML/events/159043422/
We compare/contrast several open source frameworks which have emerged for Machine Learning workflows, including KNIME, IPython Notebook and related Py libraries, Cascading, Cascalog, Scalding, Summingbird, Spark/MLbase, MBrace on .NET, etc. The analysis develops several points for "best of breed" and what features would be great to see across the board for many frameworks... leading up to a "scorecard" to help evaluate different alternatives. We also review the PMML standard for migrating predictive models, e.g., from SAS to Hadoop.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
Keynote at DIGIT West Expo, Glasgow on 29 May 2024.
Cheryl Hung, ochery.com
Sr Director, Infrastructure Ecosystem, Arm.
The key trends across hardware, cloud and open-source; exploring how these areas are likely to mature and develop over the short and long-term, and then considering how organisations can position themselves to adapt and thrive.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
3. Data Science, Why does it matter?
What is the only skill that matters for a data scientist?
“the ability to use the resources available to them to solve a challenge.”
Solving problems, the only skill you need to know
The skill of solving problems.
We both accomplished a lot in tackling this challenge. For some of the problems
we did well, for some we could improve.
This challenge shows the ability to solve problems over and above the actual
“answers” we sought.
I think too often we seek out people who have one particular skill or another, rather
than general problem solving abilities.
Certainly there is a time and a place for expertise with a particular set of skills. But
the skill of adaptability is often overlooked.
Think on this, the next time you are considering who you need to assist in solving a
problem.
4. Cloudera Certified Professional:
Data Scientist
Intent of CCP:DS
Demonstrate Knowledge in a Variety of Data
Science Topics
Demonstrate the Knowledge at Scale
Requirements
Pass Cloudera’s Data Science Essentials Exam (DS-
200)
Pass Cloudera’s Data Science Challenge (semi-
annual; use simulated data to solve real problems)
Change coming in Q2 2015
SME Expertise
Math &
Statistics
Knowledge
Computer
Science
Skills
Data
Science
Machine
Learning
Reuters Article on CCP:DS certification
Data Acquisition
Data Evaluation
Data
Transformation
Machine Learning
Clustering
Classification
Model/ Selection
Feature Selection
Probability
Visualization
Optimization
Collaborative
Filtering
Topics
5. Fall 2014 Data Science Challenge
Timeline: October 21, 2014 to January 21, 2015
Each person sitting for the challenge has to submit individual
solutions for each problem.
Problems:
Problem 1: Smartfly – Predict probability of a flight being delayed.
Problem 2: Almost Famous – Statistical analysis of web log data.
Problem 3: Winklr – Who should follow whom.
6. Multiple Ways to Solve a Problem
Problem
100,000 FT High Overview of Solution Tools Used
Mark Doug
Smartfly (ML – binary
classification)
• Hive to Explore the Data
• Python & MapReduce to Format
and Clean the Input
• Spark MLLIB for Model
Data Science at the Command Line.
Scripts, counts, summaries.
“Pseudo Map-Reduce”
R plotting.
Spark MLLib for predictions.
Almost Famous (spam
filter & statistical
analysis)
• Python to Explore the Data
• Python to Filter and Answer
Questions
Data Science at the command line.
Scripts, counts, summaries.
“Pseudo Map-Reduce”
SciPY for particular functions.
Winklr (social network
analysis)
• Hive and Command Line to
Explore the Data
• Mahout, Spark, Command Line
and Python to develop a hybrid
recommender
Gephi, for analysis of subgraphs.
Python to format the data.
Spark GraphX for solution.
Shell scripts to get the data in the
required format
7. Smartfly – Problem Summary
Motivation
Client is an online travel service that provides timely travel information to their
customers
Their product team has come up with an idea of using flight data to predict whether a
flight will be delayed and use that information to respond proactively.
Given
7,374,365 records of historic flight data at 279 airports and 17 airlines
566,376 records of scheduled flight data
Requirements
Rank all scheduled flights in order of descending probability of delay
8. Smartfly – Raw Data (Starting Point)
1-Unique Flight ID (int)
2-Year (int)
3-Month (int)
4-Day of Month (int)
5-Day of Week (int)
6-Scheduled Departure (HHMM)
7-Scheduled Arrival (HHMM)
8-Airline (string)
9-Flight Number (int)
10-Tail Number (string)
11-Plane Model (string)
12-Seat Configuration (string)
13-Departure Delay in Minutes (int)
14-Origin Airport (string)
15-Destination Airport (string)
16-Distance Travelled in Miles (int)
17-Taxi In Time in Minutes (int)
18-Taxi Out Time in Minutes (int)
19-Cancelled (Boolean)
20-Cancellation Code (string)
Historic and Scheduled Data was provided in CSV format with the
following fields in each row:
10. Model Evaluation Criteria
Set the evaluation criteria prior to running
any models, similar to setting the null and
alternate hypothesis prior to conducting an
experiment
Selected criteria: Area Under Receiver
Operating Characteristic Curve (auROC)
Compare different models
Independent of cutoff
No cutoff assumptions required
11. Model Evaluation Criteria
Area Under Receiver Operating
Characteristic Curve (auROC)
Weighted Confusion Matrix
12. Data Exploration
Used Hive primarily
SELECT Max(distance), Min(distance)
FROM sfhist
Determined range of values for
each field
Looked at delays by airline, airport,
plane model…
Are there mismatches in data (ex.
Cancelled = 0, but a valid Cancel
Code is present)
13. Input Data Manipulation
Format to input for ML algorithm (LIBSVM format) using Python and
Map Reduce
Created dictionaries of airports, airlines, plane models, seat
configurations & holidays
LIBSVM – efficient sparse matrix
0 10:1 13:1 46:1 51:1 52:1 67:1 77:1 82:1 106:1 674:1 804:1 3225:1
1 9:1 42:1 45:1 54:1 54:1 75:1 77:1 84:1 291:1 458:1 801:1 3891:1
Deal with errors & omissions in data
Validate
Manual calculation at the head/tail/changes
Verify the correct number of records
Response
0 = no delay
1 = delayed
Features
1-12: Month
13-43: Day
…
1K-7K: Tailnumber
7001+: Holidays
14. Train the Model
Split the historic data into training and testing subsets
Split randomly
Split based on time
Run the model in Spark
Load the formatted input
Set model parameters
Run the model (train the SVM or Logistic Regression Model)
15. Test the model
Use the model to predict delays in the test data and compare to determine the
auROC (Spark)
Repeat using a range of iterations, model types (SVM/Log), regularization
parameter (size of step), and regularization technique (L1/L2)
Results
Worst: auROC = 0.51 SVM using only flight times and default optimization settings
Best: auROC = 0.68
Logisitic Regressioin
L2 Reglarization (2000 iterations, step = 0.0001)
Categorical input for: month, day, weekday, time of day (6hr blocks), departing airport,
arrival airport, airline, seat configuration, flight number (type of flight), & holidays
Represents a 36% improvement over random selection
Predict the Scheduled for Submission
16. Smartfly Review
Ability to use all of the data
Unable to run a SVM / logistic regression model in R with ~ 6million rows
Spark completed final model in ~ 10 min
Can be used for any binary decisions process
Issue loan or not
Purchase stock or not
Other ML algorithms, the basic process remains the same
Linear regression – predict a value
Clustering – segment your data for reporting
Collaborative filter – recommend products to customers…
17. Winklr – Problem Summary
Who should Follow whom?
Winklr is a curiously popular social network for fans of the sitcom Happy Days.
Users can post photos, write messages, and most importantly, follow each other’s
posts and content. This helps users keep up with new content from their favorite
users on the site.
Basically Winklr is a site that is set up similar to Twitter. We want to provide
recommendations on who to follow. We know that some people have “clicked”
on another user (I interpret this as a “Favorite”, or a “Re-Tweet”)
19. My solution
Type of problem: Graph Analysis
Create a Master Graph.
Run Page Rank to identify centrality.
Create many small graphs for individual users.
Mask the Master Graph, and PageRank Graph.
Multiply out Centrality, number of in Degrees for a possible followers,
and the inverse of the length of the path away from this particular
user to a candidate vertex to be followed.
This code runs in about 60 hours using Spark GraphX.
Code: Problem3.sh, and AnalyzeGraph.scala
20. Doug’s Problem Solving approach
This is the approach I took, and may or may not be useful for others to apply.
Analysis. I started with some basic numbers, and just browsing through the data with the “Data Science at the Command
line toolkit”. This is very handy for getting a feel for things.
Based on some general understanding this analysis provided, create a “pipeline”
Generally the data has to be transformed to a usable structure for the particular method of solving the problem.
Do some basics with the problem solving method, Stats, ML, Graph, etc…
Get some data back out of that tool, then format output to specification.
Iterate.
I did this for problem 1, moved to problem 2, then finally problem 3. Then went back to 1, back to 2, back to 3.
This method allowed me to give some “space” to myself, and actually look at the each problem with fresh eyes on more
than one occasion.
Breaking the basics down of Input, Process, Output for each problem allowed me to have “working” code for each
problem really quickly, then through tuning, analysis, research, and some time to think about the problem, I was able to
come up with each unique solution.
It also allows me to refactor the code, having given each problem time to “rest”.
Very much like a painting, broad strokes first, details emerge as the painting progresses.
Another benefit is, if I am able to get the data all the way through the pipeline, it becomes obvious where the
performance bottlenecks are for the pipeline.
This method does take a bit of time.
21. Graph Analysis
As Graphs get really large it becomes difficult to visualize them.
However, I was able to “subset” the master graph based on the
recommendation output of my process.
I was expecting to see one big clump of nodes tightly connected.
This would be the “Target” to follow.
I was also expecting to see two smaller clumps of nodes, loosely
connected to the larger clump. These are the “followers”, as we
make a recommendation to them to follow the more popular node,
they will be closer connected to this user.
Here is the output from Gephi that shows whether the code worked
or not.
23. Looks good, except I was wrong.
The challenge is looking for those “Likely” to follow someone.
So this part called for something a little different than what I coded.
It appears they were looking for the neighbors of the people that
were already being followed.
This is a much less complicated problem than I actually solved.
I look forward to seeing what Data Science Challenge 4 will look
like.
24. Where to go from here?
Spark.
Scala.
Learn these topics.
Teach these topics.
Especially for folks planning on sitting for Data Science challenge 4: Learn
Scala. Learn Spark.
Oh, and keep studying about Graphs…
For an example of what not to do: Doug's github link
Recent change – This is apparently the final Data Science challenge. Future
CCP:DS certs will be based on a testing format.
Editor's Notes
Doug and Mark intro
Doug and Mark intro
Doug
Mark
Mark
Doug and Mark
Mark
Switch to command line early – will start Model and then move back to this presentation. Wait on prompt from Mark to go to cmd line or discuss prior (dependent on how trial goes prior to presentation).
Animation Blacks out the fields that are NA (null) in the scheduled data set. Click/down arrow to start this animation when Mark starts to mention the items that are not provided/nulled out in the sample
Animation:
1st click when start talking about Classifciation
2nd click when I mention SVM