This document provides an outline for a machine learning syllabus. It includes 14 modules covering topics like machine learning terminology, supervised and unsupervised learning algorithms, optimization techniques, and projects. It lists software and hardware requirements for the course. It also discusses machine learning applications, issues, and the steps to build a machine learning model.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
Slide presentasi ini dibawakan oleh Imron Zuhri dalam acara Seminar & Workshop Pengenalan & Potensi Big Data & Machine Learning yang diselenggarakan oleh KUDO pada tanggal 14 Mei 2016.
This session will demystify (generative) AI by exploring its workings as an advanced statistical modelling tool (suitable for any level of technical knowledge). Not only will this session explain the technological underpinnings of AI, it will also address concerns and (long-term) requirements around ethical and practical usage of AI. This includes data preparation and cleaning, data ownership, and the value of data-generated - but not owned - by libraries. It will also discuss the potentials for (hypothetical) use cases of AI in collections environments and making collections data AI-ready; providing examples of AI capabilities and applications beyond chatbots.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Automated machine learning lectures given at the Advanced Course on Data Science & Machine Learning. AutoML, hyperparameter optimization, Bayesian optimization, Neural Architecture Search, Meta-learning, MAML
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
Basics of machine learning. Fundamentals of machine learning. These slides are collected from different learning materials and organized into one slide set.
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...Lucas Jellema
Our technology has gotten smart and fast enough to make predictions and come up with recommendations in near real time. Machine Learning is the art of deriving models from our Big Data collections – harvesting historic patterns and trends – and applying those models to new data in order to rapidly and adequately respond to that data. This presentation will explain and demonstrate in simple, straightforward terms and using easy to understand practical examples what Machine Learning really is and how it can be useful in our world of applications, integrations and databases. Hadoop and Spark, real time and streaming analytics, Watson and Cloud Datalab, Jupyter Notebooks, Oracle Machine Learning CS and the Citizen Data Scientists all make their appearance, as does SQL.
Machine learning for IoT - unpacking the blackboxIvo Andreev
Have you ever considered Machine Learning as a black box? It sounds as a kind of magic happening. Although being one among many solutions available, Azure ML has proved to be a great balance between flexibility, usability and affordable price. But how does Azure ML compare with the other ML providers? How to choose the appropriate algorithm? Do you understand the key performance indicators and how to improve the quality of your models? The session is about understanding the black box and using it for IoT workload and not only.
Presentation on the OpenML initiative to enable open, collaborative machine learning during the data@Sheffield event. We discuss how data, machine learning algorithms and experiments can be analysed collaboratively by data scientists and domain scientists, as well as citizen scientists.
This session will demystify (generative) AI by exploring its workings as an advanced statistical modelling tool (suitable for any level of technical knowledge). Not only will this session explain the technological underpinnings of AI, it will also address concerns and (long-term) requirements around ethical and practical usage of AI. This includes data preparation and cleaning, data ownership, and the value of data-generated - but not owned - by libraries. It will also discuss the potentials for (hypothetical) use cases of AI in collections environments and making collections data AI-ready; providing examples of AI capabilities and applications beyond chatbots.
The Power of Auto ML and How Does it WorkIvo Andreev
Automated ML is an approach to minimize the need of data science effort by enabling domain experts to build ML models without having deep knowledge of algorithms, mathematics or programming skills. The mechanism works by allowing end-users to simply provide data and the system automatically does the rest by determining approach to perform particular ML task. At first this may sound discouraging to those aiming to the “sexiest job of the 21st century” - the data scientists. However, Auto ML should be considered as democratization of ML, rather that automatic data science.
In this session we will talk about how Auto ML works, how is it implemented by Microsoft and how it could improve the productivity of even professional data scientists.
Automated machine learning lectures given at the Advanced Course on Data Science & Machine Learning. AutoML, hyperparameter optimization, Bayesian optimization, Neural Architecture Search, Meta-learning, MAML
The Machine Learning Workflow with AzureIvo Andreev
Machine learning is not black magic but a discipline that involves data analysis, data science and of course – hard work. From searching patterns in data, applying algorithms to converting to usable predictions, you would need background and appropriate tools. In this session, we will go through major approaches to prepare data, build and deploy ML models in Azure (ML Studio, DataScience VM, Jupyter Notebook). Most importantly – based on some examples from the real world, we will provide you with a workflow of best practices.
Basics of machine learning. Fundamentals of machine learning. These slides are collected from different learning materials and organized into one slide set.
The Art of Intelligence – A Practical Introduction Machine Learning for Oracl...Lucas Jellema
Our technology has gotten smart and fast enough to make predictions and come up with recommendations in near real time. Machine Learning is the art of deriving models from our Big Data collections – harvesting historic patterns and trends – and applying those models to new data in order to rapidly and adequately respond to that data. This presentation will explain and demonstrate in simple, straightforward terms and using easy to understand practical examples what Machine Learning really is and how it can be useful in our world of applications, integrations and databases. Hadoop and Spark, real time and streaming analytics, Watson and Cloud Datalab, Jupyter Notebooks, Oracle Machine Learning CS and the Citizen Data Scientists all make their appearance, as does SQL.
Machine learning for IoT - unpacking the blackboxIvo Andreev
Have you ever considered Machine Learning as a black box? It sounds as a kind of magic happening. Although being one among many solutions available, Azure ML has proved to be a great balance between flexibility, usability and affordable price. But how does Azure ML compare with the other ML providers? How to choose the appropriate algorithm? Do you understand the key performance indicators and how to improve the quality of your models? The session is about understanding the black box and using it for IoT workload and not only.
Presentation on the OpenML initiative to enable open, collaborative machine learning during the data@Sheffield event. We discuss how data, machine learning algorithms and experiments can be analysed collaboratively by data scientists and domain scientists, as well as citizen scientists.
What is an Algorithm
Time Complexity
Space Complexity
Asymptotic Notations
Recursive Analysis
Selection Sort
Insertion Sort
Recurrences
Substitution Method
Master Tree Method
Recursion Tree Method
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. S H I WA N I
G U P T A
M A C H I N E
L E A R N I N G
1
2. S Y L L A B U S
Introduction to Machine Learning (1) 6
Machine Learning terminology, Types of Machine Learning, Issues in Machine Learning, Application of Machine
Learning, Steps in developing ML application, How to choose the right algorithm
Data Preprocessing (3) 10
Data Cleaning (missing value, outlier), Exploratory Data Analysis (descriptive statistics, Visualization), Feature
Engineering (Data Transformation (encoding, skew, scale), Feature selection)
Supervised Learning with Regression (1) 5
Simple Linear, Multiple Linear, Polynomial, Overfit/Undefit, Regularization, Evaluation Metric, Use case
Supervised Learning with Classification (3) 12
k Nearest Neighbor, Logistic Regression, Linear SVM, Kernels, Decision Tree (CART), Issues in DT learning,
Ensembles (Bagging – Random Forest, Boosting – Gradient Boost), Evaluation metric, Use case
Optimization Techniques (2) 6
Model Selection techniques ( Cross Validation), Gradient Descent Algorithm, Grid Search method, Model Evaluation
technique (Bias, Variance)
Unsupervised Learning with clustering and Reinforcement Learning (2) 6
k Means algorithm, Dimensionality Reduction, Use case, Elements of Reinforcement Learning, Temporal Difference
Learning, Online Learning, Use case
2
3. M O D U L E 1 ( 6 H O U R )
• Machine Learning terminology
• Types of Machine Learning
• Issues in Machine Learning
• Application of Machine Learning
• Steps in developing ML application
• How to choose the right algorithm
3
4. S / W A N D H / W R E Q U I R E M E N T
16+ GB RAM, 4+ CORES, SSD storage, Amazon AWS, MS Azure, Google cloud
Python Data Science S/W stack (pip, conda)
NumPy – Linear Algebra
Pandas – Data read / process
Scikit-Learn – ML algo
Matplotlib – Visualization
Seaborn – more aesthetically pleasing
Plotly – interactive visualization library
tsne – high dimensional visualization
StatsModel – statistical models
SciPy – optimization
Tkinter – GUI lib for python
PyTorch – open source framework
Keras – high level API and open source framework
TensorFlow - open source framework
Theano – multidim array manipulation
NLTK – human language data
BeautifulSoup – navigating webpage
Bokeh – interactive visualizations
TextBlob – process textual data
SHAP – Shaplely Additive exPlanations
xAI – eXplainable AI
•IDE – Spyder, Jupyter notebook, PyCharm, Google Colab
4
PROJECT
7. P R E R E Q U I S I T E S
• Probability and Statistics (r.v., prob distrib, statistic – mean,
median, mode, variance, s.d., covariance, Baye’s theorem,
entropy)
• Linear Algebra (matrix, vector, tensors, eigen value, eigen
vector)
• Calculus (functions, derivatives of single variable and
multivariate functions)
• Python language
• Structured thinking, communication and prob solving
• Business understanding
7
8. W H Y I S M L G E T T I N G A T T E N T I O N R E C E N T L Y
This development is driven by a few underlying forces:
• The amount of data generated is increasing significantly with reduction in the cost of
sensors
• The cost of storing this data has reduced significantly
• The cost of computing has come down significantly
• Cloud has democratized compute for the masses
8
FUTURE
9. M L V S A U T O M A T I O N
• If you are thinking that machine learning is nothing but a new name for automation – you
would be wrong. Most of the automation which has happened in the last few decades has
been rule-driven automation. For example – automating flows in our mailbox needs us to
define the rules. These rules act in the same manner every time.
• On the other hand, machine learning helps machines learn by past data and change their
decisions/performance accordingly. Spam detection in our mailboxes is driven by machine
learning. Hence, it continues to evolve with time.
9
PROJECT
10. D E F I N I T I O N
“A computer program is said to learn from experience E with respect to some class of tasks T and performance
measure P, if its performance at tasks in T, as measured by P, improves with experience E” - Tom Mitchell
“Machine learning enables a machine to automatically learn from data, improve performance from experiences,
and predict things without being explicitly programmed.”
“Machine learning is a subfield of artificial intelligence, which enables machines to learn from past data or
experiences without being explicitly programmed.”
“Science of getting computers act without explicit programming” - Arthur Samuel
10
EXAM
11. S C I E N C E O F T E A C H I N G M A C H I N E S H O W T O L E A R N B Y S E L F
Eg. the task of mopping and cleaning the floor.
• When a human does the task – the quality of outcome would vary. The human would get exhausted / bored after a few hours of
work. The human would also get sick at times. Depending on the place – it could also be hazardous or risky for a human.
• Machines can do high frequency repetitive tasks with high accuracy without getting tired. On the other hand, if we can teach
machines to detect whether the floor needs cleaning and mopping and how much cleaning is required based on the condition of
the floor and the type of the floor, machines would be far better in doing the same job. They can go on to do that job without
getting tired or sick!
• This is what Machine Learning aims to do - enable machines to learn on their own.
In order to answer questions like:
• Whether the floor needs cleaning and mopping?
• How long does the floor need to be cleaned?
• Machines need a way to think and this is precisely where machine learning models help. The machines capture data from the
environment and feed it to the machine learning model. The model then uses this data to predict whether the floor needs cleaning
or not. And, for how long does it need the cleaning.
11
12. H O W D O M A C H I N E S L E A R N
• Tasks difficult for humans can be very simple for machines. e.g. multiplying very large numbers.
• Tasks which look simple to humans can be very difficult for machines!
• You only need to demonstrate cleaning and mopping to a human a few times before they can perform it on
their own.
• But, that is not the case with machines. We need to collect a lot of data along with the desired outcomes in
order to teach machines to perform specific tasks.
• This is where machine learning comes into play. Machine Learning would help the machine understand the
kind of cleaning, the intensity of cleaning, and duration of cleaning based on the conditions and nature of the
floor.
12
13. T O O L S
Language
• R
• Python
• SAS
• Julia
• Java
• Scala
Database
• SQL
• Oracle
• Hadoop
Visualisation
• D3.js
• Tableau
• QlikView
13
FUTURE
14. T E R M I N O L O G Y
• Dataset (training, validation, testing)
• .csv file
• Structured vs unstructured data
• predictor, target, explanatory, independent, dependent, response variable
• Instance
• Features (numerical, discrete, categorical, ordinal, nominal)
• Model
• Hypothesis
14
PROJECT
17. T Y P E S
• Supervised Learning – labelled (binary and multi class)
• Classification – discrete response eg. LoR, NB, kNN,
SVM, DT, RF, GBM, XGB, NN
Eg. spam filtering, waste classification
• Regression – continuous response eg. LR, SVR, DTR,
RFR
Eg. changes in temperature, stock price prediction
17
EXAM
18. T Y P E S
• Unsupervised Learning - unlabelled
• Clustering eg. k means, hierarchical, NN
Eg. customer segmentation, city planning, cell phone tower for optimal signal reception
• Association eg. Apriori
Eg. diaper and beer, bread and milk
• Dimensionality Reduction eg. PCA, SVD
Eg. MNIST data (70000X784), face recognition (698X4096)
• Anomaly Detection eg. kNN, kMeans
Eg. Fraud detection, fault detection, outlier detection
• Semi supervised learning
• Speech Analysis, Web content classification, Google Expander
18
EXAM
19. T Y P E S
• Reinforcement Learning maximise cumulative reward eg. Q-Learning, SARSA, DQN
Eg. robotic dog, Tic Tac Toe
• Neural Network eg. recognise dog
• Deep Learning eg. chat bot, real time bidding, recommender system
• Natural Language Processing eg. Lemmatisation, Stemming
Eg. customer service complaints, virtual assistant
• Computer Vision eg. Canny edge detection, Haar Cascade classifier
Eg. skin cancer diagnosis, detect real time traffic, guided surgery
• Evolutionary Learning (GA, Optimisation algorithms)
Eg. Super Mario
19
EXAM
20. I S S U E S I N M A C H I N E L E A R N I N G
• What are the existing algorithm for learning?
• When will algorithm converge?
• Which algo perform best for what kind of problems?
• How much data sufficient? eg. training to classify cat and dog
• Non representative training data e.g. Exit poll during elections
• Poor quality of data eg. Outliers, Missing
• How many features required? Irrelevant features
• Overfitting training data
• Underfitting training data
• Computation power? eg. GPU and TPU for ML and DL
• Interpretability of model? eg. Why bank declined loan for customer
• How to improve learning?
• Optimization vs Generalization?
• New and better algorithms required
• Need for more data scientists
20
EXAM
21. P R O J E C T I D E A S ( 4 0 )
• Fraud detection
• Predict low oxygen level during surgery
• Recognise CVD factors
• Movie recommendation (Netflix)
• Marketing and Sales
• Weather prediction
• Traffic Prediction (Uber ATG)
• Loan defaulting prediction
• Handwriting recognition
• Sentiment analysis
• Human activity recognition
• Sports predictor
• Big Mart Sales prediction
• Fake news detection
• Disease prediction
• Stock market analysis
• Amazon Alexa
• Search Engine Optimization
• Auto-tagging and Friend
suggestion (Facebook)
• Swiggy and Uber Eats
• House price prediction
• Market Analysis
• Handwritten digit recognition
• Equipment failure prediction
• Prospective insurance buyer
• Google News
• Video Surveillance
• Movie Ticket pricing system
• Object Detection
21
PROJECT
22. M L U S E C A S E I N S M A R T P H O N E S
• From the voice assistant that sets your alarm and finds you the best restaurants to the simple
use case of unlocking your phone via facial recognition – Machine Learning is truly
embedded in our favourite devices.
• Voice Assistants
• Smartphone Cameras
• App Store and Play Store Recommendations
• Face Unlock
22
EXAM
23. M L U S E C A S E I N T R A N S P O R TAT I O N
• The application of machine learning in the transport industry has gone to an entirely different
level in the last decade. This coincides with the rise of ride-hailing apps like Uber, Lyft, Ola,
etc. These companies use machine learning throughout their many products, from planning
optimal routes to deciding prices for the rides we take. So, let’s look at a few popular use
cases in transportation which use machine learning heavily.
• Dynamic Pricing in Travel
• Transporting and Commuting - Uber
• Google Maps
23
EXAM
24. M L U S E C A S E I N W E B S E R V I C E S
• We interact with certain applications every day multiple times. What we perhaps did not
realize until recently, most of these applications work thanks to the power and flexibility of
Machine Learning.
• Email Filtering
• Google Search
• Google Translate
• Facebook and LinkedIn Recommendations
24
EXAM
25. M L U S E C A S E I N S A L E S A N D M A R K E T I N G
• Top companies in the world are using Machine Learning to transform their strategies from top
to bottom. The two most impacted functions? Marketing and Sales!
• These days if you’re working in the Marketing and Sales field, you need to know at least one
Business Intelligence tool (like Tableau or Power BI). Additionally, marketers are expected to
know how to leverage Machine Learning in their day-to-day role to increase brand
awareness, improve the bottom line, etc.
• Recommendation Engine
• Personalized Marketing
• Customer Support (Chatbots)
25
EXAM
26. M L U S E C A S E I N F I N A N C I A L D O M A I N
• Most of the jobs in Machine Learning are geared towards the financial domain. And that
makes sense! This is the ultimate numbers field. A lot of banking institutions till recently used
to lean on Logistic Regression (a simple machine learning algorithm) to crunch these
numbers.
• Fraud Detection
• Personalized Banking
26
EXAM
27. S T E P S I N B U I L D I N G A M L A P P L I C AT I O N
• Frame and define the business problem to ML problem
• What is the main objective? What are we trying to predict?
• What are the target features?
• What is the input data? Is it available?
• What kind of problem are we facing? Binary classification? Clustering?
• What is the expected improvement?
• Define performance metric
• Regression problems use certain evaluation metrics such as Mean Squared Error (MSE).
• Classification problems use evaluation metrics as Precision, Accuracy and Recall.
27
EXAM
28. S T E P S I N B U I L D I N G A M L A P P L I C AT I O N
• Gathering Data
• RSS feed, web scraping, API
• Generating Hypothesis
• Can our outputs be predicted given the inputs.
• Our available data is informative enough to learn the relationship between the inputs and the outputs
• Exploratory Data Analysis (Visualisation for outlier)
• Data Preparation and cleaning (Missing Value)
• Delete relevant info or samples
• Missing value imputation
28
EXAM
29. S T E P S I N B U I L D I N G A M L A P P L I C AT I O N
• Feature Engineering (Encoding, Transformation)
• Mapping Ordinal features
• Encoding Nominal class labels
• Normalization, Standardization
• Define benchmark / baseline model (kNN, NB)
• Chose model
• Train/build Model (train:validation:test)
• Shuffle for classification
• For weather prediction, stock price prediction etc. data should not be shuffled, as the sequence of data is a crucial feature.
• Evaluate Model for Optimal Hyperparameters (cross validation)
• Tune Model (Grid search, Randomized search)
• Model testing and Deployment for prediction
29
EXAM
30. C H O I C E O F R I G H T A L G O R I T H M
30
EXAM
31. S T E P S F O R S E L E C T I N G R I G H T M L A L G O
• Understand your Data
• Type of data will decide algorithm
• Algo will decide no. of samples
Eg. NB will work with categorical data and is not sensitive to missing data
• Stats and Visualization to know your data
• Percentile helps to identify outlier, median to identify central tendency
• Box plot (outlier), Histogram (spread), Scatter plot (bivariate relationship)
• Clean data w.r.t Missing value
• Feature Engineering
• Encoding
• Feature creation
31
EXAM
32. S T E P S F O R S E L E C T I N G R I G H T M L A L G O
• Categorize the problem
• By I/P (supervised, unsupervised)
• By O/P (regression, classification, clustering, anomaly detection)
• Understand constraints (data storage capacity, real time applications, fast learning)
• Look for available algorithm (business goals met?, preprocessing required?, accuracy?, explain ability?,
speed?, scalable?)
• Try each, assess and compare
• Optimize
• Evaluate performance
• Repeat if required
32
EXAM
33. C H O I C E O F M O D E L ( U S E C A S E )
• Linear Regression: unstable with redundant feature
Eg. Sales prediction, Time for commuting
• Logistic Regression: not blackbox, works with correlated features
Eg. Fraud detection, Customer churn prediction
• Decision Tree: can handle outliers but overfit and take large memory
Eg. Bank loan defaulter, Investment decision
• SVM: memory intensive, hard to interpret and difficult to tune
Eg. Text classification, Handwritten character recognition
• NB: less training data required, low memory requirement, faster
Eg. Sentiment analysis, Recommender systems
• RF: works well with large data and high dimension
Eg. Predict loan defaulters, Predict patients for high risk
• NN: resource and memory intensive
Eg. Object Recognition, Natural Language Translation
• K-means: grouping but no. of groups unknown
Eg. Customer Segmentation, Crime locality identification
• PCA: dimensionality reduction
Eg. MNIST digits
33
PROJECT
34. C H O I C E O F M E T R I C
• Regression
• Mean Square Error, Root MSE, R-squared
• Mean Absolute Error if outliers
• R2
• Classification
• Accuracy, LogLoss, ROC-AUC, Precision Recall
• Kappa score, MCC
• Unsupervised
• Mutual Information
• RAND index
• Reinforcement Learning
• Dispersion across Time
• Risk across Time
34
PROJECT
35. P R O J E C T L A B O R I E N TAT I O N
Installing Anaconda and Python
Step-1: Download Anaconda Python: www.anaconda.com/distribution/
Step- 2: Install Anaconda Python (Python 3.7 version): double click on the ".exe" file of
Anaconda
Step- 3: Open Anaconda Navigator: use Anaconda navigator to launch a Python IDE such as
Spyder and Jupyter Notebook
Step- 4: Close the Spyder/Jupyter Notebook IDE.
https://colab.research.google.com
https://github.com
35
PROJECT
36. P R O J E C T TA S K L I S T Study tool for implementation
Project title and Course identification
Chose data (Understand Domain and data)
Perform EDA
Perform Feature Engineering
Chose model
Train and Validate model
Tune Hyperparameters
Test and Evaluate model
Prepare Report
Prepare Technical Paper
Present Case Study
36
PROJECT
37. E X P E C TAT I O N S
Case Study Presentation
Mini Project
Technical Paper
Report
Competition (Inhouse, Online)
37
PROJECT
38. C A S E S T U D Y T I T L E S ( 3 1 )
MNIST
MS-COCO
ImageNet
CIFAR
IMDB Reviews
WordNet
Twitter Sentiment Analysis
BreastCancer Wisconsin
BBC News
Wheat seeds
Amazon Reviews
Facial Image
Spam SMS
YouTube
Chars74K
WineQuality
IrisFlowers
LabelMe
HoTPotQA
Ionosphere
Xview
US Census
Boston House Price
BankNote authentication
PIMA Indian Diabetes
BBC Sport
Titanic
Santander Product Recommendation
Sonar
Swedish Auto Insurance
Abalone
38
PROJECT
39. B O O K S A N D D ATA S E T R E S O U R C E S
• https://www.kaggle.com/datasets
• https://archive.ics.uci.edu/ml/index.php
• https://registry.opendata.aws/
• https://toolbox.google.com/datasetsearch
• https://msropendata.com/
• https://github.com/awesomedata/awesome-public-datasets
• Indian Government dataset
• US Government Dataset
• Northern Ireland Public Sector Datasets
• European Union Open Data Portal
• https://scikit-learn.org/stable/datasets/index.html
• https://data.world
• http://archive.ics.uci.edu/ml/datasets
• https://www.ehdp.com/vitalnet/datasets.htm
• https://www.data.gov/health/
• “Python Machine Learning”, Sebastian Raschka, Packt
publishing
• “Machine Learning In Action”, Peter Harrington,
DreamTech Press
• “Introduction to Machine Learning” Ethem Alpaydın,
MIT Press
• “Machine Learning” Tom M. Mitchell, McGraw Hill
• “Machine Learning - An Algorithmic Perspective”
Stephen Marsland, CRC Press
• “Machine Learning ― A Probabilistic Perspective”
Kevin P. Murphy, MIT Press
• “Pattern Recognition and Machine Learning”,
Christopher M. Bishop, Springer
• “Elements of Statistical Learning” Trevor Hastie,
Robert Tibshirani, Jerome Friedman, Springer
39
40. L E A R N I N G R E S O U R C E S
• https://www.analyticsvidhya.com
• https://towardsdatascience.com
• https://analyticsindiamag.com
• https://machinelearningmastery.com
• https://www.datacamp.com
• https://www.superdatascience.com
• https://www.elitedatascience.com
• https://medium.com
• Siraj Raval youtube channel
• https://mlcontests.com
• https://www.datasciencechallenge.net
• https://www.machinehack.com
• https://www.hackerearth.com
• www.hackerearth.com
• www.kaggle.com/competitions
• www.smartindiahackathon.gov.in
• www.datahack.analyticsvidhya.com
• www.daretocompete.com
• https://github.com
40
42. S U M M A R Y ( S U M M AT I V E A S S E S S M E N T )
• Examine steps in developing Machine Learning application with respect to your mini project. [10]
• Review the issues in Machine Learning. [10]
• State applicable use case for each ML algorithm. [10]
• Examine Applications of AI. [10]
• Illustrate steps for selecting right ML algorithm. [10]
• Define ML and differentiate between Supervised, Unsupervised and Reinforcement learning with the help of suitable examples. [10]
• Explain ML w.r.t. identifying Tasks, Experience and Performance measure (Tom Mitchell). [10]
• designing a checkers learning problem
• designing a handwriting recognition learning problem
• designing a Robot driving learning problem
• Illustrate with example how Supervised learning can be used in handling loan defaulters. [10]
• Explain Supervised Learning with neat diagram. [10]
42
EXAM