This document provides an overview of machine learning techniques that can be applied in finance, including exploratory data analysis, clustering, classification, and regression methods. It discusses statistical learning approaches like data mining and modeling. For clustering, it describes techniques like k-means clustering, hierarchical clustering, Gaussian mixture models, and self-organizing maps. For classification, it mentions discriminant analysis, decision trees, neural networks, and support vector machines. It also provides summaries of regression, ensemble methods, and working with big data and distributed learning.
Video: http://videos.re-work.co/videos/464-agile-deep-learning
Deep Learning has been called the ‘new electricity’ — transforming every industry. Innovative architectures and applications receive deserved attention. But to turn innovation into value requires integrating deep learning into practical technology products. Such products, including Spotify's, are often developed following the principles of agile. This talk focuses on approaching deep learning in an agile way and on integrating deep learning into the agile cadence of a modern software development organization.
Deep Learning: Chapter 11 Practical MethodologyJason Tsai
Lecture for Deep Learning 101 study group to be held on June 9th, 2017.
Reference book: https://www.deeplearningbook.org/
Past video archives: https://goo.gl/hxermB
Initiated by Taiwan AI Group (https://www.facebook.com/groups/Taiwan.AI.Group/)
An introduction to Machine Learning (and a little bit of Deep Learning)Thomas da Silva Paula
25-min talk about Machine Learning and a little bit of Deep Learning. Starts with some basic definitions (Supervised and Unsupervised Learning). Then, neural networks basic functionality is explained, ending up in Deep Learning and Convolutional Neural Networks.
Machine Learning Meetup that happened in Porto Alegre, Brazil.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introducing-machine-learning-and-how-to-teach-machines-to-see-a-presentation-from-tryolabs/
Facundo Parodi, Research and Machine Learning Engineer at Tryolabs, presents the “Introduction to Machine Learning and How to Teach Machines to See” tutorial at the September 2020 Embedded Vision Summit.
What is machine learning? How can machines distinguish a cat from a dog in an image? What’s the magic behind convolutional neural networks? These are some of the questions Parodi answers in this introductory talk on machine learning in computer vision.
Parodi introduces machine learning and explores the different types of problems it can solve. He explains the main components of practical machine learning, from data gathering and training to deployment. He then focuses on deep learning as an important machine learning technique and provides an introduction to convolutional neural networks and how they can be used to solve image classification problems. Parodi will also touches on recent advancements in deep learning and how they have revolutionized the entire field of computer vision.
Video: http://videos.re-work.co/videos/464-agile-deep-learning
Deep Learning has been called the ‘new electricity’ — transforming every industry. Innovative architectures and applications receive deserved attention. But to turn innovation into value requires integrating deep learning into practical technology products. Such products, including Spotify's, are often developed following the principles of agile. This talk focuses on approaching deep learning in an agile way and on integrating deep learning into the agile cadence of a modern software development organization.
Deep Learning: Chapter 11 Practical MethodologyJason Tsai
Lecture for Deep Learning 101 study group to be held on June 9th, 2017.
Reference book: https://www.deeplearningbook.org/
Past video archives: https://goo.gl/hxermB
Initiated by Taiwan AI Group (https://www.facebook.com/groups/Taiwan.AI.Group/)
An introduction to Machine Learning (and a little bit of Deep Learning)Thomas da Silva Paula
25-min talk about Machine Learning and a little bit of Deep Learning. Starts with some basic definitions (Supervised and Unsupervised Learning). Then, neural networks basic functionality is explained, ending up in Deep Learning and Convolutional Neural Networks.
Machine Learning Meetup that happened in Porto Alegre, Brazil.
For the full video of this presentation, please visit:
https://www.edge-ai-vision.com/2021/02/introducing-machine-learning-and-how-to-teach-machines-to-see-a-presentation-from-tryolabs/
Facundo Parodi, Research and Machine Learning Engineer at Tryolabs, presents the “Introduction to Machine Learning and How to Teach Machines to See” tutorial at the September 2020 Embedded Vision Summit.
What is machine learning? How can machines distinguish a cat from a dog in an image? What’s the magic behind convolutional neural networks? These are some of the questions Parodi answers in this introductory talk on machine learning in computer vision.
Parodi introduces machine learning and explores the different types of problems it can solve. He explains the main components of practical machine learning, from data gathering and training to deployment. He then focuses on deep learning as an important machine learning technique and provides an introduction to convolutional neural networks and how they can be used to solve image classification problems. Parodi will also touches on recent advancements in deep learning and how they have revolutionized the entire field of computer vision.
Capitalico / Chart Pattern Matching in Financial Trading Using RNNAlpaca
Capitalico is a web/mobile platform that utilizes deep learning to help financial traders build automated trading system by understanding their trading charts. In this talk I show many of the techniques we developed to achieve the best performance and accuracy in deep learning for sequence pattern matching.
Recommendation system using collaborative deep learningRitesh Sawant
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional
CF-based methods use the ratings given to items by users
as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in
many applications, causing CF-based methods to degrade
significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as
item content information may be utilized. Collaborative
topic regression (CTR) is an appealing recent method taking
this approach which tightly couples the two components that
learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be
very effective when the auxiliary information is very sparse.
To address this problem, we generalize recent advances in
deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model
called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback)
matrix. Extensive experiments on three real-world datasets
from different domains show that CDL can significantly advance the state of the art.
In this Lunch & Learn session, Chirag Jain gives us a friendly & gentle introduction to Machine Learning & walks through High-Level Learning frameworks using Linear Classifiers.
Machine learning the next revolution or just another hypeJorge Ferrer
These are the slides of my session and ModConf / Liferay DevCon 2016.
It attempts to make it easy for any developer to get started with Machine Learning. It presents three exercises which I'm giving as homework (yup, homework, you missed it, right? ;) to the audience.
The video for this session is now available at https://www.facebook.com/liferay/videos/vl.383534535315216/10154154247423108/?type=1 (starts at min 34)
Using Deep Learning to Find Similar DressesHJ van Veen
Report by Luís Mey ( https://www.linkedin.com/in/lu%C3%ADs-gustavo-bernardo-mey-97b38927/ ) on Udacity Machine Learning Course - Final Project: Use Deep Learning to Find Similar Dresses.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Machine Learning for Dummies (without mathematics)ActiveEon
It presents an introduction and the basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.
Deep Learning with Python: Getting started and getting from ideas to insights in minutes.
PyData Seattle 2015
Alex Korbonits (@korbonits)
This presentation was given July 25, 2015 at the PyData Seattle conference hosted by PyData and NumFocus.
Deep Learning: concepts and use cases (October 2018)Julien SIMON
An introduction to Deep Learning theory
Neurons & Neural Networks
The Training Process
Backpropagation
Optimizers
Common network architectures and use cases
Convolutional Neural Networks
Recurrent Neural Networks
Long Short Term Memory Networks
Generative Adversarial Networks
Getting started
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
It’s no longer needed supercomputers and a team with PhDs from MIT to create predictive models based on data. We are witnessing innovations in machine learning that are making it an increasingly accessible field. This lecture aims to demystify machine learning through exposure to concepts and use of a number of technologies. In this talk, we will address the types of problems and the algorithms, always applied to real problems. Also, open source tools like Scikit-learn will be presented as well as a way to practice and try these ideas through competitions like Kaggle.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
Capitalico / Chart Pattern Matching in Financial Trading Using RNNAlpaca
Capitalico is a web/mobile platform that utilizes deep learning to help financial traders build automated trading system by understanding their trading charts. In this talk I show many of the techniques we developed to achieve the best performance and accuracy in deep learning for sequence pattern matching.
Recommendation system using collaborative deep learningRitesh Sawant
Collaborative filtering (CF) is a successful approach commonly used by many recommender systems. Conventional
CF-based methods use the ratings given to items by users
as the sole source of information for learning to make recommendation. However, the ratings are often very sparse in
many applications, causing CF-based methods to degrade
significantly in their recommendation performance. To address this sparsity problem, auxiliary information such as
item content information may be utilized. Collaborative
topic regression (CTR) is an appealing recent method taking
this approach which tightly couples the two components that
learn from two different sources of information. Nevertheless, the latent representation learned by CTR may not be
very effective when the auxiliary information is very sparse.
To address this problem, we generalize recent advances in
deep learning from i.i.d. input to non-i.i.d. (CF-based) input and propose in this paper a hierarchical Bayesian model
called collaborative deep learning (CDL), which jointly performs deep representation learning for the content information and collaborative filtering for the ratings (feedback)
matrix. Extensive experiments on three real-world datasets
from different domains show that CDL can significantly advance the state of the art.
In this Lunch & Learn session, Chirag Jain gives us a friendly & gentle introduction to Machine Learning & walks through High-Level Learning frameworks using Linear Classifiers.
Machine learning the next revolution or just another hypeJorge Ferrer
These are the slides of my session and ModConf / Liferay DevCon 2016.
It attempts to make it easy for any developer to get started with Machine Learning. It presents three exercises which I'm giving as homework (yup, homework, you missed it, right? ;) to the audience.
The video for this session is now available at https://www.facebook.com/liferay/videos/vl.383534535315216/10154154247423108/?type=1 (starts at min 34)
Using Deep Learning to Find Similar DressesHJ van Veen
Report by Luís Mey ( https://www.linkedin.com/in/lu%C3%ADs-gustavo-bernardo-mey-97b38927/ ) on Udacity Machine Learning Course - Final Project: Use Deep Learning to Find Similar Dresses.
Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.
Predictive Model and Record Description with Segmented Sensitivity Analysis (...Greg Makowski
Describing a predictive data mining model can provide a competitive advantage for solving business problems with a model. The SSA approach can also provide reasons for the forecast for each record. This can help drive investigations into fields and interactions during a data mining project, as well as identifying "data drift" between the original training data, and the current scoring data. I am working on open source version of SSA, first in R.
Machine Learning for Dummies (without mathematics)ActiveEon
It presents an introduction and the basic concepts of machine learning without mathematics. This is a short presentation for beginners in machine learning.
Deep Learning with Python: Getting started and getting from ideas to insights in minutes.
PyData Seattle 2015
Alex Korbonits (@korbonits)
This presentation was given July 25, 2015 at the PyData Seattle conference hosted by PyData and NumFocus.
Deep Learning: concepts and use cases (October 2018)Julien SIMON
An introduction to Deep Learning theory
Neurons & Neural Networks
The Training Process
Backpropagation
Optimizers
Common network architectures and use cases
Convolutional Neural Networks
Recurrent Neural Networks
Long Short Term Memory Networks
Generative Adversarial Networks
Getting started
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
It’s no longer needed supercomputers and a team with PhDs from MIT to create predictive models based on data. We are witnessing innovations in machine learning that are making it an increasingly accessible field. This lecture aims to demystify machine learning through exposure to concepts and use of a number of technologies. In this talk, we will address the types of problems and the algorithms, always applied to real problems. Also, open source tools like Scikit-learn will be presented as well as a way to practice and try these ideas through competitions like Kaggle.
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
Ini adalah slide tambahan dari materi pengenalan Big Data Analytics (di file berikutnya), yang mengajak kita mulai hands-on dengan beberapa hal terkait Machine/Deep Learning, Big Data (batch/streaming), dan AI menggunakan Tensor Flow
Neural Networks and Deep Learning for PhysicistsHéloïse Nonne
Introduction to neural networks and deep learning. Seminar given by Héloïse Nonne on February 19th, 2015 at CINaM (Centre Interdisciplinaire de Nanosciences de Marseille) at Aix-Marseille University
Learn the advantages and disadvantages of machine learning algorithms versus traditional statistical modelling approaches to solve complex business problems.
Big data & analytics for banking new york lars hambergLars Hamberg
BIG DATA & ANALYTICS FOR BANKING SUMMIT, New York, 1 Dec 2015.
Keynote address: "How Predictive Analytics will change the Financial Services Sector”
Speaker : Lars Hamberg
http://www.specialistspeakers.com/?p=8367
Overview & Outlook: Why Big Data will over-deliver on its hype and transform Financial Services; Use cases with Advanced Analytics and Big Data Analytics in Financial Services, in Production & Distribution of banking products; new opportunities for incumbents in tomorrow’s ecosystem; big data, bigdata, analytics, smart data, data analytics, digitization, digitalization, predictive analytics, sentiment analysis, financial services, banking, asset management, distribution, retail, trading, technology, innovation, fintech, wealth, asset management, investment industry, robo advisory, social investing, behavior, profiling, client segmentation, alias matching, semantic memory models, unstructured data, machine learning, pattern recognition
Cognitive Computing and IBM Watson Solutions in FinTech Industry - 2016Sasha Lazarevic
What is cognitive computing? How IBM Watson can contribute to the development of innovative FinTech solutions? The presentation of Sasha Lazarevic and Pierre Kaufmann from Geneva, Switzerland to the FinTech community on the potential domans of application
Presentation given on TechnicalAnalyst.com event "Machine learning techniques in finance" on 17th November 2016.
- What is machine learning and how it can help predict finnacial markets
- Technical stock analysis vs. behavioural news and social media analysis
- How machine learning can be applied to technical analysis in the stock market
- How machine learning can be applied to new/social media analysis
SAM: Sympathetic AI Messenger bot </violence> hackathon 2016Leslie W
How do we build AI bots that are not just cold and calculating but sympathetic, empathetic and compassionate in their care? We present SAM a FB Messenger Bot trained using machine learning on compassionate texts from the Dalai Lama.
In this presentation, Andrew Covato talks about the uses of attribution modelling and Big Data within a large company like Google. Covato introduces himself and talks about his own working background before going on to talk about the ins and outs of marketing and digital marketing. As well as giving insight into attribution modelling with his experience.
Behavioral Analytics for Financial IntelligenceJohn Liu
Predictive analytics and machine learning has led to new methods for modeling human behavior and cognition. These methods are collectively known as behavioral analytics, focusing on how and why individuals and groups take actions and respond to them. This presentation discusses how financial services organizations are learning to leverage behavioral analytics for a broad set of applications that span customer insight, fraud detection, compliance, and market/investment intelligence.
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...Ian Lumb
Watch On Demand Anytime via http://www.univa.com/resources/webinar-machine-learning.php
Armed with nothing more than an Apache Spark toting laptop, you have all the trappings required to prototype the application of Machine Learning against your data-science needs. From programmability in Scala, Java or Python, to built-in support for Machine Learning via MLlib, Spark is an exceedingly effective enabler that allows you to rapidly produce results.
Of course, as soon as your prototyping proves successful, you'll want to scale out to embrace the volume, variety and velocity that characterizes today's Big Data demands... in production. Because Spark is as comfortable on an isolated laptop as it is in a distributed-computing environment, addressing Big Data requirements in production boils down to effectively and efficiently embracing containers and clusters for Big Data Analytics.
And this is where offerings from Univa shine - i.e., in making the transition from prototype to production completely seamless. For some use cases, it makes sense to scale-in Spark based applications within Docker containers via Univa Grid Engine Container Edition or Navops by Univa; whereas in others, Spark is interfaced (as a Mesos-compliant framework) with Univa Universal Resource Broker, to permit scaling out on a cluster. In both scenarios, your production Spark applications are scheduled alongside other classes of workload - without a need for dedicated resources.
Agenda:
• Overview of Apache Spark as a platform for Deep Learning - from Python-based Jupyter Notebooks to Spark's Machine Learning library MLlib
• Overview of prototyping Machine Learning via Apache Spark on a laptop - without and within Docker containers
• Introductions to Univa Grid Engine Container Edition and Univa Universal Resource Broker plus Navops by Univa
• Overview of production Big Data Analytics platforms for Machine Learning
• Docker-containerized Apache Spark and Univa Grid Engine Container Edition
• Docker-containerized Apache Spark and Navops by Univa
• Apache Spark plus Univa Universal Resource Broker
• Introducing support for GPUs without and within Docker containers
• Use case example - using Machine Learning to classify data from Twitter without and within Docker containers
• Summary and next steps
Watch On Demand Anytime via http://www.univa.com/resources/webinar-machine-learning.php
Developing for Hybrid Cloud with BluemixRoberto Pozzi
How can you get all the benefits of developing your application in the cloud and guarantee a secure integration in a Hybrid Cloud scenario?
This deck, presented at IBM CloudKnow event in October 2014, explains how to do it with @IBMBluemix, the Platform as a Service solution from IBM.
The application is available on http://cloudknow-italy-web.mybluemix.net/home.html.
Anomaly detection using deep one class classifier홍배 김
- Anomaly detection의 다양한 방법을 소개하고
- Support Vector Data Description (SVDD)를 이용하여
cluster의 모델링을 쉽게 하도록 cluster의 형상을 단순화하고
boundary근방의 애매한 point를 처리하는 방법 소개
May 2015 talk to SW Data Meetup by Professor Hendrik Blockeel from KU Leuven & Leiden University.
With increasing amounts of ever more complex forms of digital data becoming available, the methods for analyzing these data have also become more diverse and sophisticated. With this comes an increased risk of incorrect use of these methods, and a greater burden on the user to be knowledgeable about their assumptions. In addition, the user needs to know about a wide variety of methods to be able to apply the most suitable one to a particular problem. This combination of broad and deep knowledge is not sustainable.
The idea behind declarative data analysis is that the burden of choosing the right statistical methodology for answering a research question should no longer lie with the user, but with the system. The user should be able to simply describe the problem, formulate a question, and let the system take it from there. To achieve this, we need to find answers to questions such as: what languages are suitable for formulating these questions, and what execution mechanisms can we develop for them? In this talk, I will discuss recent and ongoing research in this direction. The talk will touch upon query languages for data mining and for statistical inference, declarative modeling for data mining, meta-learning, and constraint-based data mining. What connects these research threads is that they all strive to put intelligence about data analysis into the system, instead of assuming it resides in the user.
Hendrik Blockeel is a professor of computer science at KU Leuven, Belgium, and part-time associate professor at Leiden University, The Netherlands. His research interests lie mostly in machine learning and data mining. He has made a variety of research contributions in these fields, including work on decision tree learning, inductive logic programming, predictive clustering, probabilistic-logical models, inductive databases, constraint-based data mining, and declarative data analysis. He is an action editor for Machine Learning and serves on the editorial board of several other journals. He has chaired or organized multiple conferences, workshops, and summer schools, including ILP, ECMLPKDD, IDA and ACAI, and he has been vice-chair, area chair, or senior PC member for ECAI, IJCAI, ICML, KDD, ICDM. He was a member of the board of the European Coordinating Committee for Artificial Intelligence from 2004 to 2010, and currently serves as publications chair for the ECMLPKDD steering committee.
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
How should data be preprocessed for use in machine learning algorithms? How to identify the most predictive attributes of a dataset? What features can generate to improve the accuracy of a model?
Feature Engineering is the process of extracting and selecting, from raw data, features that can be used effectively in predictive models. As the quality of the features greatly influences the quality of the results, knowing the main techniques and pitfalls will help you to succeed in the use of machine learning in your projects.
In this talk, we will present methods and techniques that allow us to extract the maximum potential of the features of a dataset, increasing flexibility, simplicity and accuracy of the models. The analysis of the distribution of features and their correlations, the transformation of numeric attributes (such as scaling, normalization, log-based transformation, binning), categorical attributes (such as one-hot encoding, feature hashing, Temporal (date / time), and free-text attributes (text vectorization, topic modeling).
Python, Python, Scikit-learn, and Spark SQL examples will be presented and how to use domain knowledge and intuition to select and generate features relevant to predictive models.
A brief lesson on what constitutes computational decision making, from simple regression via various classification methods to deep learning. No maths, only basic concepts to teach the lingo of machine learning to a lay audience.
In machine learning, support vector machines (SVMs, also support-vector networks) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis.
The method of identifying similar groups of data in a data set is called clustering. Entities in each group are comparatively more similar to entities of that group than those of the other groups.
We here detail how to build through variance optimization a range of portfolio over a multi risk framework factor. This methodology is much used by practitioner as the factors covariance is non singular even with few observations and more stable.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
7. SVM summary
avoid the plague of local minima
the engineer’s expertise is in the appropriate
kernel (beware of overfitting, cross-validate and
experiment your own kernels)
only classify between 2 class (one vs all or one
vs one methodology)
a reference in use cases in computer vision,
bio informatics
9. Neural Network summary
Gradient descent algorithm : stochastic, mini-
batch, conjugate
plague of local minima : difficult to calibrate
the engineer’s expertise is in the appropriate
architecture (beware of overfitting, cross-
validate and experiment your own architecture
‘deeper learning’)
10. >> t = classregtree(X,Y);
>> Y_pred = t(X_new);
Regression Trees
13. Why a regression and what is a
regression ?
A regression is a model to explain and predict a process :
supervised machine learning
14. Why regularizing ?• Terms are correlated
• The regression matrix becomes close to singular
• Badly conditioned matrix yield poor numerical results
• Bayesian interpretation
Likelihood
Regularisation term
Posterior
Prior
We rather minimize
15. Why Lasso and Elastic Net?• No method owns the truth
• Reduce the number of predictors in a regression model
• Identify important predictors
• Select among redundant predictors
• Produce shrinkage estimates with potentially lower
predictive errors than ordinary least squares (cross
validation)
Lasso :
Elastic Net :
16. Ensemble learning
Why ensemble learning ?
‘melding results from many weak learners into one high-
quality ensemble predictor’
17. Main differences between Bagging and
Boosting
BAGGING BOOSTING
Bagging is randomness Boosting is adaptative and deterministic
Bootstrapped sample Complete initial sample
Each model must perform well over the whole
sample
Each model has to perform better than the
previous one on outliers
Every model have the same weight Models are weighted according to their
performance
Defining features
Advantages and disadvantages
BAGGING BOOSTING
Reducing model variance Variance might rise
Not a simple model anymore Not a simple model anymore
Can be parallelized Can not be parallelized
Less noise over fitting : better than boosting
when noise
Models are weighted according to their
performance
Bagging is usually efficienter than boosting On specific cases, boosting might achieve a far
better accuracy
24. Exploratory Data Analysis
Why exploratory analysis ? Can be used to:
o Graphical view
o “Pre filtering”: preliminary data trends and behaviour
• Means:
• Multivariate Plots
• Features transformation : principal component analysis, factor model
• Features selection : stepwise optimization
28. Factor model
Alternative to PCA to improve your components
>>[Lambda,Psi,T,stats,F]=factoran(stocks,3,'rotate','promax);
-1
-0.5
0
0.5
1 -1
-0.5
0
0.5
1
-1
-0.5
0
0.5
1
Component 2
DeutscheBank
DaimlerAllianzMAN
ThyssenKrupp
BMWLufthansa
Siemens
DeutschePost
Commerzbank
BASF
Adidas
Linde
MunichRe
MetroHeidelberger
SAP
Bayer
Salzgitter
Infineon
DeutscheBahn
EONRWE
VW
DeutscheTelekom
BeiersdorfMRKFresenius
Henkel
FreseniusMedical
Component 1
Component3
29. Paring predictors : stepwise optimization Some predictors might be correlated, other irrelevant
Requires Statistics Toolbox™
>>[coeff,inOut]=stepwisefit(stocks, index);
2007 2008 2009 2010 2011
-0.1
0
0.1
0.2
0.3
Returns
original data
stepwise fit
2007 2008 2009 2010 2011
0.5
1
1.5
Prices
30. Cloud of randomly generated points
• Each cluster center is randomly chosen inside specified bounds
• Each cluster contains the specified number of points per cluster
• Each cluster point is sampled from a gaussian distribution
• Multidimensionnal dataset
>>clusters = 8; % number of clusters.
>>points = 30; % number of points in each cluster.
>>std_dev = 0.05; % common cluster standard deviation
>>bounds = [0 1]; % bounds for the cluster center
>>[x,vcentroid,proportions,groups] =cluster_generation(bounds,clusters,points,std_dev);
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Group1
Group2
Group3
Group4
Group5
Group6
Group7
Group8
31. Clustering Why clustering ?
o Segment populations into natural subgroups
o Identify outliers
o As a preprocessing method – build separate models on each
• Means
• Hierarchical clustering
• Clustering with neural network (self-organizer map, competitive layer)
• Clustering with K-means nearest neighbours
• Clustering with K-means fuzzy logic
• Clustering using Gaussian mixture models
• Predictors: categorical, ordinal, discontinuous -0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Input Vectors
x(1)
x(2)
33. Hierarchical Cluster Analysis – how do I do it ?
• Calculate pairwise distances between points
>> distances = pdist(x)
• Carry out hierarchical cluster analysis
>> tree = linkage(distances)
• Visualise as a dendrogram
>> dendrogram(tree)
• Assign points to clusters
>> assignments = cluster(tree,‘cutoff',0.1)
34. Assessing the quality of a hierarchical cluster
analysis
• The cophenetic correlation coefficient measures how
closely the length of the tree links match the original
distances between points
• How ‘faithful’ the tree is to the original data
• 0 is poor, 1 is good
>> cophenet(tree,distances)
35. K-Means Cluster Analysis – what is it doing?
Randomly pick K cluster
centroids
Assign points to the
closest centroid
Recalculate positions of
cluster centroids
Reassign points to the
closest centroid
Recalculate positions of
cluster centroids
Repeat until centroid positions converge
………
36. K-Means Cluster Analysis – how do I do it ?
Running the K-mean algorithm for K fixed
>> [memberships,centroids] = kmeans(x,K);
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
37. Evaluating a K-Means analysis and choosing K
• Try a range of different K’s, and
compare the point-centroid
distances for each
>> for K=3:15
[clusters,centroids,distances] =
kmeans(data,K);
totaldist(K-2)=sum(distances);
end
plot(3:15,totaldist);
• Create silhouette plots
>> silhouette(data,clusters)
38. Sidebar: Distance Metrics
• Measures of how similar datapoints are – different
definitions make sense for different data
• Many built-in distance metrics, or define your own
>> doc pdist
>> distances = pdist(data,metric); %pdist = pairwise distances
>> squareform(distances)
>> kmeans(data,k,’distance’,’cityblock’) %not all metrics supported
Euclidean Distance
Default
Cityblock Distance
Useful for discrete variables
Cosine Distance
Useful for clustering variables
39. Fuzzy c-means Cluster Analysis – what is it doing?
• Very similar to K-means
• Samples are not assigned definitively to a cluster, but
have a ‘membership’ value relative to each cluster
Requires Fuzzy Logic Toolbox™
Running the fuzzy K-mean algorithm
for K fixed
>> [centroids, memberships]=fcm(x,K);
40. Gaussian Mixture Models
• Assume that data is drawn from a fixed number K of normal
distributions
• Fit these parameters using the EM algorithm
>> gmobj = gmdistribution.fit(x,8);
>> assignments = cluster(gmobj,x);
Plot the probability density
>> ezsurf(@(x,y)pdf(gmobj,[x y]));
0
0.2
0.4
0.6
0.8
1
0.2
0.4
0.6
0.8
1
0
10
20
41. Evaluating a Gaussian Mixture Model clustering
• Plot the probability density function of the model
>> ezsurf(@(x,y)pdf(gmobj,[x y]));
• Plot the posterior probabilities of observations
>> p = posterior(gmobj,data);
>> scatter(data(:,1),data(:,2),5,p(:,g)); % Do this for each group g
• Plot the Mahalanobis distances of observations to components
>> m = mahal(gmobj,data);
>> scatter(data(:,1),data(:,2),5,m(:,g)); % Do this for each group g
42. Choosing the right number of components in a
Gaussian Mixture Model
• Evaluate for a range of K and plot AIC and/or BIC
• AIC (Akaike Information Criterion) and BIC (Bayesian
Information Criterion) are measures of the quality of
the model fit, with a penalty for higher K
>> for K=3:15
gmobj = gmdistribution.fit(data,K);
AIC(K-2) = gmobj.AIC;
end
plot(3:15,AIC);
43. Neural Networks – what are they?
Input
variables
Weights
Bias
Transfer
function
Output
variable
A two layer
feedforward
network
Build your
architecture
44. Self Organising Maps Neural Net – what are they?
• Start with a regular grid of
‘neurons’ laid over the dataset
• The size of the grid gives the
number of clusters
• Neurons compete to recognise
datapoints (by being close to
them)
• Winning neurons are moved
closer to the datapoints
• Repeat until convergence
-0.5 0 0.5 1
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
SOM Weight Positions
Weight 1
Weight2
-0.2 0 0.2 0.4 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
SOM Weight Positions
Weight 1
Weight2
45. Summary: Cluster analysis
No method owns the truth
Use the diagnostic tools to assess your clusters
Beware of local minima : global optimization
46. Classification
Why classification ? Can be used to:
o Learning the way to classify from already classified
observations
oClassify new observations
• Means:
• Discriminant analysis classification
• Bootstrapped aggregated decision tree classifier
• Neural network classifier
• Support vector machine classifier
-0.1 0 0.1 0.2 0.3 0.4 0.5 0.6
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Group1
Group2
Group3
Group4
Group5
Group6
Group7
Group8
47. Discriminant Analysis – how does it work?
• Fit a multivariate normal density to each class
• linear — Fits a multivariate normal density to each group,
with a pooled estimate of covariance. This is the default.
• diaglinear — Similar to linear, but with a diagonal
covariance matrix estimate (naive Bayes classifiers).
• quadratic — Fits multivariate normal densities with
covariance estimates stratified by group.
• diagquadratic — Similar to quadratic, but with a diagonal
covariance matrix estimate (naive Bayes classifiers).
• Classify a new point by evaluating its probability for
each density function, and classifying to the highest
probability
49. Interpreting Discriminant Analyses
• Visualise the posterior probability
surfaces
>> [XI,YI] = meshgrid(linspace(4,8),
linspace(2,4.5));
>> X = XI(:); Y = YI(:);
>> [class,err,P] = classify([X Y],
meas(:,1:2), species,'quadratic');
>> for i=1:3
ZI = reshape(P(:,i),100,100);
surf(XI,YI,ZI,'EdgeColor','none');
hold on;
end
50. Interpreting Discriminant Analyses
• Visualise the probability density
of sample observations
• An indicator of the region in
which the model has support
from training data
>> [XI,YI] = meshgrid(linspace(4,8),
linspace(2,4.5));
>> X = XI(:); Y = YI(:);
>> [class,err,P,logp] = classify([X Y],
meas(:,1:2), species, 'quadratic');
>> ZI = reshape(logp,100,100);
>> surf(XI,YI,ZI,'EdgeColor','none');
51. Classifying K-Nearest Neigbours – what does it do?
• One of the simplest classifiers – a sample is classified
by taking the K nearest points from the training set,
and choosing the majority class of those K points
• There is no real training phase – all the work is done
during the application of the model
>> classes =
knnclassify(sample,training,group,K)
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
-0.5
0
0.5
1
1.5
x1
x2
group1
group2
group3
group4
group5
group6
group7
group8
52. Decision Trees – how do they work?
• Threshold value for a variable
that partitions the dataset
• Threshold for all predictors
• Resulting model is a tree where
each node is a logical test on a
predictor (var1<thresh1,
var2>thresh2)
53. Decision Trees – how do I build them ?
• Build tree model
>> tree = classregtree(x,y);
>> view(tree)
• Evaluate the model on new data
>> tree(x_new)
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
-0.5
0
0.5
1
1.5
x1
x2
group1
group2
group3
group4
group5
group6
group7
group8
54. Enhancing the model : bagged trees
• Prune the decision tree
>> [cost,secost,ntnodes,bestlevel] =test(t, 'test', x, y);
>> topt = prune(t, 'level', bestlevel);
• Bootstrapped aggregated trees forest
>> [cost,secost,ntnodes,bestlevel] =test(t, 'test', x, y);
>> forest = TreeBagger(100, x, y);
>> y_pred = predict(forest,x);
• Visualise class boundaries as before
-0.4 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
-0.5
0
0.5
1
1.5
x1
x2
group1
group2
group3
group4
group5
group6
group7
group8
55. Pattern Recognition Neural Network– what are
they?
• Two-layer (i.e. one-hidden-layer) feed forward neural
networks can learn any input-output relationship
given enough neurons in the hidden layer.
• No restrictions on the predictors
56. Pattern Recognition Neural Network– how do I
build them ?
• Build a neural network model
>> net = patternnet(10);
• Train the net to classify
observations
>> [net,tr] = train(net,x,y);
• Apply the model to new data
>> y_pred = net(x);
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
x1
x2
1
2
3
4
5
6
7
8
57. Support Vector Machines – what are they?
• The SVM algorithm finds a boundary between the classes
that maximises the minimum distance of the boundary
to any of the points
• No restrictions on the predictors
• 1 vs all to classify multiple classes
58. Support Vector Machines – how do I build them ?
• Build an SVM model
>> svmmodel = svmtrain(x,y)
• Try different kernel functions
>> svmmodel =
svmtrain(x,y,’kernel_function’,’rbf’)
• Apply the model to new data
>> classes =
svmclassify(svmmodel,x_new);
-1
0
1
2
3
4
1
2
Support Vectors
59. Evaluating a Classifying Model
• Three main strategies
• Resubstitution – test the model on the same data that you
trained it with
• Cross-Validation
• Holdout Test on a completely new dataset
• Use cross-validation to evaluate model parameters such as the number of leaf
for a tree or the number of hidden neurons.
Apply cross validation to your classifying model
>> cp = cvpartition(y,'k',10);
>> ldaFun= @(xtrain,ytrain,xtest)(classify(xtest,xtrain,ytrain));
>> ldaCVErr = crossval('mcr',x,y,'predfun',ldaFun,'partition',cp)
60. Summary: Classification algorithms
No absolute best methods
Simple does not mean inefficient
Decision trees produce models and neural network overfit the
noise : use bootstrapping and cross-validation
Parallelize
61. Regression
Why Regression ? Can be used to:
oLearn to model a continuous response from observations
oPredict the response for new observations
• Means:
• Linear regressions
• Non-linear regressions
• Bootstrapped regression tree
• Neural network as a fitting tool
62. New data set with a continuous response from one
predictor
• Non-linear function to fit
• A continuous response to fit from one continuous predictor
>>[x,t] = simplefit_dataset;
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
63. Linear Regression – what is it?
• A collection of methods that
find the best coefficients b
such that y ≈ X*b
• Best b means minimising
the least squares difference
between the predicted and
actual values of y
• “Linear” means linear in b –
you can include extra
variables to give a nonlinear
relationship in X
64. Linear Regression – how do I do it ?
>> b = xy
• Linear Regression
>> b = regress(y, [ones(size(X,1),1) x])
>> stats = regstats(y, [ones(size(x,1),1) x])
• Robust Regression – better in the presence of outliers
>> robust_b = robustfit(X,y) %NB (X,y) not (y,X)
• Ridge Regression – better if data is close to collinear
>> ridge_b = ridge(y,X,k) %k is the ridge parameter
• Apply the model to new data
>> y = newdata*b;
65. Interpreting a linear regression model
• Examine coefficients to see
which predictors have a large
effect on the response
>> [b,bint,r,rint,stats]=regress(y,X)
>> errorbar(1:size(b,1),b, b-
bint(:,1),bint(:,2)-b)
• Examine residuals to check for
possible outliers
>> rcoplot(r,rint)
• Examine R2 statistic and p-
value to check overall model
significance
>> stats(1)*100 %R2 as a percentage
>> stats(3) %p-value
• Additional diagnostics with
regstats
67. Fit Neural Network– what are they?
• Fitting networks are feedforward neural networks used to fit
an input-output relationship.
• This architecture can learn any input-output relationship given
enough neurons.
• No restrictions on the predictors
(categorical,ordinal,discontinuous)
68. Fit Neural Network– how do I build them ?
• Build a fit neural net model
>> net = fitnet(10);
• Train the net to fit the target
>> [net,tr] = train(net,x,t);
• Apply the model to new data
>> y_pred = net(x);
0 1 2 3 4 5 6 7 8 9
-2
0
2
4
6
8
10
12
Function Fit for Output Element 1
OutputandTarget
-0.02
0
0.02
0.04
Error
Input
Targets
Outputs
Errors
Fit
Targets - Outputs
69. Regression trees– what are they?
• A decision tree with binary splits for regression. An object
of class RegressionTree can predict responses for new data
with the predict method.
• No restrictions on the predictors
(categorical,ordinal,discontinuous)
70. Regression trees – how do I use them?
• Build a fit neural net model
>> rtree = RegressionTree.fit(x,t);
• Train the net to fit the target
>> y_tree = predict(rtree,x);
• Apply the model to new data
>> y_pred = net(x);
0 1 2 3 4 5 6 7 8 9 10
0
5
10
0 10 20 30 40 50 60 70 80 90 100
0
0.5
1
1.5
x 10
-15
71. Summary
Data Mining
Exploration
Univariate
Pie chart,
Histogram, etc…
Multivariate
Feature
selection and
transformation
Modelling
Clustering
Partitive
K-means
Gaussian
mixture model
SOMHierarchical
Classification
Discriminant
Decision Tree
Neural Network
Support Vector
Machine
Regression