This document provides an introduction to machine learning. It begins with an agenda that lists topics such as introduction, theory, top 10 algorithms, recommendations, classification with naive Bayes, linear regression, clustering, principal component analysis, MapReduce, and conclusion. It then discusses what big data is and how data is accumulating at tremendous rates from various sources. It explains the volume, variety, and velocity aspects of big data. The document also provides examples of machine learning applications and discusses extracting insights from data using various algorithms. It discusses issues in machine learning like overfitting and underfitting data and the importance of testing algorithms. The document concludes that machine learning has vast potential but is very difficult to realize that potential as it requires strong mathematics skills.
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
If there is one crucial thing in building ML models, this would be the data preparation. That is the process of transforming raw data to a state where machine learning algorithms could be run to disclose insights and make predictions. Data preparation involves analysis, depends on the nature of the problem and the particular algorithms. As far as there are knowledge and experience involved, there is no such thing as automation, which makes the role of the data scientist the key to success.
ML is trendy and Microsoft already have more than 10 services to support ML. So we will focus on tools like Azure ML Workbench and Python for data preparation, review some common tricks to approach data and experiment in Azure ML Studio.
Half day session on Machine learning and its applications. It introduces Artificial Intelligence, move on Machine Learning, applications, algorithms, types, using Cloud for ML, Deep Learning and some resources to start with
If there is one crucial thing in building ML models, this would be the data preparation. That is the process of transforming raw data to a state where machine learning algorithms could be run to disclose insights and make predictions. Data preparation involves analysis, depends on the nature of the problem and the particular algorithms. As far as there are knowledge and experience involved, there is no such thing as automation, which makes the role of the data scientist the key to success.
ML is trendy and Microsoft already have more than 10 services to support ML. So we will focus on tools like Azure ML Workbench and Python for data preparation, review some common tricks to approach data and experiment in Azure ML Studio.
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built.
Ensemble Learning is a technique that creates multiple models and then combines them to produce improved results.
Ensemble learning usually produces more accurate solutions than a single model would.
In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.
Feature Engineering in Machine LearningKnoldus Inc.
In this Knolx we are going to explore Data Preprocessing and Feature Engineering Techniques. We will also understand what is Feature Engineering and its importance in Machine Learning. How Feature Engineering can help in getting the best results from the algorithms.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaEdureka!
YouTube Link: https://youtu.be/qWHi09C3Dq0
** Machine Learning Training with Python: https://www.edureka.co/machine-learning-certification-training**
This Edureka video on 'Machine Learning in 10 Minutes' will help you understand what exactly is Machine Learning and what are the different types of Machine Learning along with some career opportunities that you can achieve through Machine Learning.
Example
What is AI?
What is Machine Learning
Steps for Machine Learning
Types of Machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Applications of Machine Learning
What can you be with Machine Learning?
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
This Random Forest Algorithm Presentation will explain how Random Forest algorithm works in Machine Learning. By the end of this video, you will be able to understand what is Machine Learning, what is classification problem, applications of Random Forest, why we need Random Forest, how it works with simple examples and how to implement Random Forest algorithm in Python.
Below are the topics covered in this Machine Learning Presentation:
1. What is Machine Learning?
2. Applications of Random Forest
3. What is Classification?
4. Why Random Forest?
5. Random Forest and Decision Tree
6. Comparing Random Forest and Regression
7. Use case - Iris Flower Analysis
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
This Logistic Regression Presentation will help you understand how a Logistic Regression algorithm works in Machine Learning. In this tutorial video, you will learn what is Supervised Learning, what is Classification problem and some associated algorithms, what is Logistic Regression, how it works with simple examples, the maths behind Logistic Regression, how it is different from Linear Regression and Logistic Regression applications. At the end, you will also see an interesting demo in Python on how to predict the number present in an image using Logistic Regression.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. What is supervised learning?
2. What is classification? what are some of its solutions?
3. What is logistic regression?
4. Comparing linear and logistic regression
5. Logistic regression applications
6. Use case - Predicting the number in an image
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Tutorial on Deep learning and ApplicationsNhatHai Phan
In this presentation, I would like to review basis techniques, models, and applications in deep learning. Hope you find the slides are interesting. Further information about my research can be found at "https://sites.google.com/site/ihaiphan/."
NhatHai Phan
CIS Department,
University of Oregon, Eugene, OR
Provides a brief overview of what machine learning is, how it works (theory), how to prepare data for a machine learning problem, an example case study, and additional resources.
ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.
An ensemble is itself a supervised learning algorithm, because it can be trained and then used to make predictions. The trained ensemble, therefore, represents a single hypothesis. This hypothesis, however, is not necessarily contained within the hypothesis space of the models from which it is built.
Ensemble Learning is a technique that creates multiple models and then combines them to produce improved results.
Ensemble learning usually produces more accurate solutions than a single model would.
In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data and recognize patterns, used for classification and regression analysis. The basic SVM takes a set of input data and predicts, for each given input, which of two possible classes forms the output, making it a non-probabilistic binary linear classifier.
Feature Engineering in Machine LearningKnoldus Inc.
In this Knolx we are going to explore Data Preprocessing and Feature Engineering Techniques. We will also understand what is Feature Engineering and its importance in Machine Learning. How Feature Engineering can help in getting the best results from the algorithms.
Machine Learning and Real-World ApplicationsMachinePulse
This presentation was created by Ajay, Machine Learning Scientist at MachinePulse, to present at a Meetup on Jan. 30, 2015. These slides provide an overview of widely used machine learning algorithms. The slides conclude with examples of real world applications.
Ajay Ramaseshan, is a Machine Learning Scientist at MachinePulse. He holds a Bachelors degree in Computer Science from NITK, Suratkhal and a Master in Machine Learning and Data Mining from Aalto University School of Science, Finland. He has extensive experience in the machine learning domain and has dealt with various real world problems.
Machine Learning in 10 Minutes | What is Machine Learning? | EdurekaEdureka!
YouTube Link: https://youtu.be/qWHi09C3Dq0
** Machine Learning Training with Python: https://www.edureka.co/machine-learning-certification-training**
This Edureka video on 'Machine Learning in 10 Minutes' will help you understand what exactly is Machine Learning and what are the different types of Machine Learning along with some career opportunities that you can achieve through Machine Learning.
Example
What is AI?
What is Machine Learning
Steps for Machine Learning
Types of Machine Learning
Supervised Learning
Unsupervised Learning
Reinforcement Learning
Applications of Machine Learning
What can you be with Machine Learning?
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
Castbox: https://castbox.fm/networks/505?country=in
Random Forest Algorithm - Random Forest Explained | Random Forest In Machine ...Simplilearn
This Random Forest Algorithm Presentation will explain how Random Forest algorithm works in Machine Learning. By the end of this video, you will be able to understand what is Machine Learning, what is classification problem, applications of Random Forest, why we need Random Forest, how it works with simple examples and how to implement Random Forest algorithm in Python.
Below are the topics covered in this Machine Learning Presentation:
1. What is Machine Learning?
2. Applications of Random Forest
3. What is Classification?
4. Why Random Forest?
5. Random Forest and Decision Tree
6. Comparing Random Forest and Regression
7. Use case - Iris Flower Analysis
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive Bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
This Logistic Regression Presentation will help you understand how a Logistic Regression algorithm works in Machine Learning. In this tutorial video, you will learn what is Supervised Learning, what is Classification problem and some associated algorithms, what is Logistic Regression, how it works with simple examples, the maths behind Logistic Regression, how it is different from Linear Regression and Logistic Regression applications. At the end, you will also see an interesting demo in Python on how to predict the number present in an image using Logistic Regression.
Below topics are covered in this Machine Learning Algorithms Presentation:
1. What is supervised learning?
2. What is classification? what are some of its solutions?
3. What is logistic regression?
4. Comparing linear and logistic regression
5. Logistic regression applications
6. Use case - Predicting the number in an image
What is Machine Learning: Machine Learning is an application of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
- - - - - - - -
About Simplilearn Machine Learning course:
A form of artificial intelligence, Machine Learning is revolutionizing the world of computing as well as all people’s digital interactions. Machine Learning powers such innovative automated technologies as recommendation engines, facial recognition, fraud protection and even self-driving cars.This Machine Learning course prepares engineers, data scientists and other professionals with knowledge and hands-on skills required for certification and job competency in Machine Learning.
- - - - - - -
Why learn Machine Learning?
Machine Learning is taking over the world- and with that, there is a growing need among companies for professionals to know the ins and outs of Machine Learning
The Machine Learning market size is expected to grow from USD 1.03 Billion in 2016 to USD 8.81 Billion by 2022, at a Compound Annual Growth Rate (CAGR) of 44.1% during the forecast period.
- - - - - -
What skills will you learn from this Machine Learning course?
By the end of this Machine Learning course, you will be able to:
1. Master the concepts of supervised, unsupervised and reinforcement learning concepts and modeling.
2. Gain practical mastery over principles, algorithms, and applications of Machine Learning through a hands-on approach which includes working on 28 projects and one capstone project.
3. Acquire thorough knowledge of the mathematical and heuristic aspects of Machine Learning.
4. Understand the concepts and operation of support vector machines, kernel SVM, naive bayes, decision tree classifier, random forest classifier, logistic regression, K-nearest neighbors, K-means clustering and more.
5. Be able to model a wide variety of robust Machine Learning algorithms including deep learning, clustering, and recommendation systems
- - - - - - -
Tutorial on Deep learning and ApplicationsNhatHai Phan
In this presentation, I would like to review basis techniques, models, and applications in deep learning. Hope you find the slides are interesting. Further information about my research can be found at "https://sites.google.com/site/ihaiphan/."
NhatHai Phan
CIS Department,
University of Oregon, Eugene, OR
This talk is about how we applied deep learning techinques to achieve state-of-the-art results in various NLP tasks like sentiment analysis and aspect identification, and how we deployed these models at Flipkart
Data By The People, For The People
Daniel Tunkelang
Director, Data Science at LinkedIn
Invited Talk at the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012)
LinkedIn has a unique data collection: the 175M+ members who use LinkedIn are also the content those same members access using our information retrieval products. LinkedIn members performed over 4 billion professionally-oriented searches in 2011, most of those to find and discover other people. Every LinkedIn search and recommendation is deeply personalized, reflecting the user's current employment, career history, and professional network. In this talk, I will describe some of the challenges and opportunities that arise from working with this unique corpus. I will discuss work we are doing in the areas of relevance, recommendation, and reputation, as well as the ecosystem we have developed to incent people to provide the high-quality semi-structured profiles that make LinkedIn so useful.
Bio:
Daniel Tunkelang leads the data science team at LinkedIn, which analyzes terabytes of data to produce products and insights that serve LinkedIn's members. Prior to LinkedIn, Daniel led a local search quality team at Google. Daniel was a founding employee of faceted search pioneer Endeca (recently acquired by Oracle), where he spent ten years as Chief Scientist. He has authored fourteen patents, written a textbook on faceted search, created the annual workshop on human-computer interaction and information retrieval (HCIR), and participated in the premier research conferences on information retrieval, knowledge management, databases, and data mining (SIGIR, CIKM, SIGMOD, SIAM Data Mining). Daniel holds a PhD in Computer Science from CMU, as well as BS and MS degrees from MIT.
How To Interview a Data Scientist
Daniel Tunkelang
Presented at the O'Reilly Strata 2013 Conference
Video: https://www.youtube.com/watch?v=gUTuESHKbXI
Interviewing data scientists is hard. The tech press sporadically publishes “best” interview questions that are cringe-worthy.
At LinkedIn, we put a heavy emphasis on the ability to think through the problems we work on. For example, if someone claims expertise in machine learning, we ask them to apply it to one of our recommendation problems. And, when we test coding and algorithmic problem solving, we do it with real problems that we’ve faced in the course of our day jobs. In general, we try as hard as possible to make the interview process representative of actual work.
In this session, I’ll offer general principles and concrete examples of how to interview data scientists. I’ll also touch on the challenges of sourcing and closing top candidates.
Presentation given by Dr. Diego Kuonen, CStat PStat CSci, on November 20, 2013, at the "IBM Developer Days 2013" in Zurich, Switzerland.
ABSTRACT
There is no question that big data has hit the business, government and scientific sectors. The demand for skills in data science is unprecedented in sectors where value, competitiveness and efficiency are driven by data. However, there is plenty of misleading hype around the terms big data and data science. This presentation gives a professional statistician's view on these terms and illustrates the connection between data science and statistics.
The presentation is also available at http://www.statoo.com/BigDataDataScience/.
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Data Science London
What 'kind of things' does a data scientist do? What are the foundations and principles of data science? What is a Data Product? What does the data science process looks like? Learning from data: Data Modeling or Algorithmic Modeling? - talk by Carlos Somohano @ds_ldn at The Cloud and Big Data: HDInsight on Azure London 25/01/13
How to Become a Data Scientist
SF Data Science Meetup, June 30, 2014
Video of this talk is available here: https://www.youtube.com/watch?v=c52IOlnPw08
More information at: http://www.zipfianacademy.com
Zipfian Academy @ Crowdflower
Introduction to Mahout and Machine LearningVarad Meru
This presentation gives an introduction to Apache Mahout and Machine Learning. It presents some of the important Machine Learning algorithms implemented in Mahout. Machine Learning is a vast subject; this presentation is only a introductory guide to Mahout and does not go into lower-level implementation details.
Myths and Mathemagical Superpowers of Data ScientistsDavid Pittman
Some people think data scientists are mythical beings, like unicorns, or they are some sort of nouveau fad that will quickly fade. Not true, says IBM big data evangelist James Kobielus. In this engaging presentation, with artwork created by Angela Tuminello, Kobielus debunks 10 myths about data scientists and their role in analytics and big data. You might also want to read the full blog by Kobielus that spawned this presentation: "Data Scientists: Myths and Mathemagical Superpowers" - http://ibm.co/PqF7Jn
For more information, visit http://www.ibmbigdatahub.com
Slides used for the keynote at the even Big Data & Data Science http://eventos.citius.usc.es/bigdata/
Some slides are borrowed from random hadoop/big data presentations
This presentation is prepared by one of our renowned tutor "Suraj"
If you are interested to learn more about Big Data, Hadoop, data Science then join our free Introduction class on 14 Jan at 11 AM GMT. To register your interest email us at info@uplatz.com
Five Things I Learned While Building Anomaly Detection Tools - Toufic Boubez ...tboubez
This is my presentation from LISA 2014 in Seattle on November 14, 2014.
Most IT Ops teams only keep an eye on a small fraction of the metrics they collect because analyzing this haystack of data and extracting signal from the noise is not easy and generates too many false positives.
In this talk I will show some of the types of anomalies commonly found in dynamic data center environments and discuss the top 5 things I learned while building algorithms to find them. You will see how various Gaussian based techniques work (and why they don’t!), and we will go into some non-parametric methods that you can use to great advantage.
DataStax | Adversarial Modeling: Graph, ML, and Analytics for Identity Fraud ...DataStax
Abstract from paper: Identity theft and the resulting creation of synthetic identities for the purpose of committing fraud, pose a growing challenge to governments and businesses across the globe. This paper describes specific research and conclusions into existing fraud detection data and supporting systems. It describes a novel, ecosystem and process based approach, Adversarial Modeling to combat what must be recognized as a complex, dynamic struggle against organized and efficient adversaries. Adversarial Modeling is a technology and process ecosystem based on distributed computing, graph theory, data mining and machine learning in a focused, purpose-designed Agile derived methodology.
About the Speaker
Rob Murphy
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...Mihai Criveti
Automate your Data Science pipeline with Ansible, Python and Kubernetes - ODSC Talk
What is Data Science and the Data Science Landscape
Process and Flow
Understanding Data
The Data Science Toolkit
The Big Data Challenge
Cloud Computing Solutions
The rise of DevOps in Data Science
Automate your data pipeline with Ansible
5 Things that Make Hadoop a Game Changer
Webinar by Elliott Cordo, Caserta Concepts
There is much hype and mystery surrounding Hadoop's role in analytic architecture. In this webinar, Elliott presented, in detail, the services and concepts that makes Hadoop a truly unique solution - a game changer for the enterprise. He talked about the real benefits of a distributed file system, the multi workload processing capabilities enabled by YARN, and the 3 other important things you need to know about Hadoop.
To access the recorded webinar, visit the event site: https://www.brighttalk.com/webcast/9061/131029
For more information the services and solutions that Caserta Concepts offers, please visit http://casertaconcepts.com/
Data Science at Scale - The DevOps ApproachMihai Criveti
DevOps Practices for Data Scientists and Engineers
1 Data Science Landscape
2 Process and Flow
3 The Data
4 Data Science Toolkit
5 Cloud Computing Solutions
6 The rise of DevOps
7 Reusable Assets and Practices
8 Skills Development
Similar to Introduction to Big Data/Machine Learning (20)
Schibsted collects and analyzes 900 million events/day using AWS. This presentation gives an overview of the systems and architecture, including the solutions to GDPR.
NoSQL databases were created to solve scalability problems with SQL databases. It turns out these problems are profoundly connected with Einstein's theory of relativity (no, honestly), and understanding this illuminates the SQL/NoSQL divide in surprising ways.
An overview of farmhouse brewing in Norway, both as it exists today, and as it was historically. Extra information on the unique Norwegian yeast cultures that still survive.
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
A presentation showing how the CAP theorem causes NoSQL databases to have BASE semantics. That is, they don't support ACID consistency. Then shows how CAP is related to Einstein's theory of relativity. And finally shows how Google Spanner and F1 provide ACID that scales.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
zkStudyClub - Reef: Fast Succinct Non-Interactive Zero-Knowledge Regex ProofsAlex Pruden
This paper presents Reef, a system for generating publicly verifiable succinct non-interactive zero-knowledge proofs that a committed document matches or does not match a regular expression. We describe applications such as proving the strength of passwords, the provenance of email despite redactions, the validity of oblivious DNS queries, and the existence of mutations in DNA. Reef supports the Perl Compatible Regular Expression syntax, including wildcards, alternation, ranges, capture groups, Kleene star, negations, and lookarounds. Reef introduces a new type of automata, Skipping Alternating Finite Automata (SAFA), that skips irrelevant parts of a document when producing proofs without undermining soundness, and instantiates SAFA with a lookup argument. Our experimental evaluation confirms that Reef can generate proofs for documents with 32M characters; the proofs are small and cheap to verify (under a second).
Paper: https://eprint.iacr.org/2023/1886
Communications Mining Series - Zero to Hero - Session 1DianaGray10
This session provides introduction to UiPath Communication Mining, importance and platform overview. You will acquire a good understand of the phases in Communication Mining as we go over the platform with you. Topics covered:
• Communication Mining Overview
• Why is it important?
• How can it help today’s business and the benefits
• Phases in Communication Mining
• Demo on Platform overview
• Q/A
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...Neo4j
Leonard Jayamohan, Partner & Generative AI Lead, Deloitte
This keynote will reveal how Deloitte leverages Neo4j’s graph power for groundbreaking digital twin solutions, achieving a staggering 100x performance boost. Discover the essential role knowledge graphs play in successful generative AI implementations. Plus, get an exclusive look at an innovative Neo4j + Generative AI solution Deloitte is developing in-house.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...SOFTTECHHUB
The choice of an operating system plays a pivotal role in shaping our computing experience. For decades, Microsoft's Windows has dominated the market, offering a familiar and widely adopted platform for personal and professional use. However, as technological advancements continue to push the boundaries of innovation, alternative operating systems have emerged, challenging the status quo and offering users a fresh perspective on computing.
One such alternative that has garnered significant attention and acclaim is Nitrux Linux 3.5.0, a sleek, powerful, and user-friendly Linux distribution that promises to redefine the way we interact with our devices. With its focus on performance, security, and customization, Nitrux Linux presents a compelling case for those seeking to break free from the constraints of proprietary software and embrace the freedom and flexibility of open-source computing.
In the rapidly evolving landscape of technologies, XML continues to play a vital role in structuring, storing, and transporting data across diverse systems. The recent advancements in artificial intelligence (AI) present new methodologies for enhancing XML development workflows, introducing efficiency, automation, and intelligent capabilities. This presentation will outline the scope and perspective of utilizing AI in XML development. The potential benefits and the possible pitfalls will be highlighted, providing a balanced view of the subject.
We will explore the capabilities of AI in understanding XML markup languages and autonomously creating structured XML content. Additionally, we will examine the capacity of AI to enrich plain text with appropriate XML markup. Practical examples and methodological guidelines will be provided to elucidate how AI can be effectively prompted to interpret and generate accurate XML markup.
Further emphasis will be placed on the role of AI in developing XSLT, or schemas such as XSD and Schematron. We will address the techniques and strategies adopted to create prompts for generating code, explaining code, or refactoring the code, and the results achieved.
The discussion will extend to how AI can be used to transform XML content. In particular, the focus will be on the use of AI XPath extension functions in XSLT, Schematron, Schematron Quick Fixes, or for XML content refactoring.
The presentation aims to deliver a comprehensive overview of AI usage in XML development, providing attendees with the necessary knowledge to make informed decisions. Whether you’re at the early stages of adopting AI or considering integrating it in advanced XML development, this presentation will cover all levels of expertise.
By highlighting the potential advantages and challenges of integrating AI with XML development tools and languages, the presentation seeks to inspire thoughtful conversation around the future of XML development. We’ll not only delve into the technical aspects of AI-powered XML development but also discuss practical implications and possible future directions.
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfPaige Cruz
Monitoring and observability aren’t traditionally found in software curriculums and many of us cobble this knowledge together from whatever vendor or ecosystem we were first introduced to and whatever is a part of your current company’s observability stack.
While the dev and ops silo continues to crumble….many organizations still relegate monitoring & observability as the purview of ops, infra and SRE teams. This is a mistake - achieving a highly observable system requires collaboration up and down the stack.
I, a former op, would like to extend an invitation to all application developers to join the observability party will share these foundational concepts to build on:
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
In his public lecture, Christian Timmerer provides insights into the fascinating history of video streaming, starting from its humble beginnings before YouTube to the groundbreaking technologies that now dominate platforms like Netflix and ORF ON. Timmerer also presents provocative contributions of his own that have significantly influenced the industry. He concludes by looking at future challenges and invites the audience to join in a discussion.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Introduction to Big Data/Machine Learning
1. Introduction to Machine Learning
2012-05-15
Lars Marius Garshol, larsga@bouvet.no, http://twitter.com/larsga
1
2. Agenda
• Introduction
• Theory
• Top 10 algorithms
• Recommendations
• Classification with naïve Bayes
• Linear regression
• Clustering
• Principal Component Analysis
• MapReduce
• Conclusion
2
3. The code
3
• I’ve put the Python source code for the
examples on Github
• Can be found at
– https://github.com/larsga/py-
snippets/tree/master/machine-learning/
7. What is big data?
7
Big Data is
any thing
which is
crash Excel.
Small Data is
when is fit in RAM.
Big Data is when is
crash because is
not fit in RAM.
Or, in other words, Big Data is data
in volumes too great to process by
traditional methods.
https://twitter.com/devops_borat
8. Data accumulation
• Today, data is accumulating at tremendous
rates
– click streams from web visitors
– supermarket transactions
– sensor readings
– video camera footage
– GPS trails
– social media interactions
– ...
• It really is becoming a challenge to store
and process it all in a meaningful way
8
9. From WWW to VVV
• Volume
– data volumes are becoming unmanageable
• Variety
– data complexity is growing
– more types of data captured than previously
• Velocity
– some data is arriving so rapidly that it must either
be processed instantly, or lost
– this is a whole subfield called “stream processing”
9
10. The promise of Big Data
• Data contains information of great
business value
• If you can extract those insights you can
make far better decisions
• ...but is data really that valuable?
13. 13
“quadrupling the average cow's
milk production since your parents
were born”
"When Freddie [as he is known]
had no daughter records our
equations predicted from his DNA
that he would be the best bull,"
USDA research geneticist Paul
VanRaden emailed me with a
detectable hint of pride. "Now he is
the best progeny tested bull (as
predicted)."
14. Some more examples
14
• Sports
– basketball increasingly driven by data analytics
– soccer beginning to follow
• Entertainment
– House of Cards designed based on data analysis
– increasing use of similar tools in Hollywood
• “Visa Says Big Data Identifies Billions of
Dollars in Fraud”
– new Big Data analytics platform on Hadoop
• “Facebook is about to launch Big Data
play”
– starting to connect Facebook with real life
https://delicious.com/larsbot/big-data
15. Ok, ok, but ... does it apply to our
customers?
• Norwegian Food Safety Authority
– accumulates data on all farm animals
– birth, death, movements, medication, samples, ...
• Hafslund
– time series from hydroelectric dams, power prices,
meters of individual customers, ...
• Social Security Administration
– data on individual cases, actions taken, outcomes...
• Statoil
– massive amounts of data from oil exploration,
operations, logistics, engineering, ...
• Retailers
– seeTarget example above
– also, connection between what people buy, weather
forecast, logistics, ...
15
16. How to extract insight from data?
16
Monthly Retail Sales in New SouthWales
(NSW) Retail Department Stores
18. Basically, it’s all maths...
18
• Linear algebra
• Calculus
• Probability theory
• Graph theory
• ...
18
https://twitter.com/devops_borat
Only 10% in
devops are know
how of work
with Big Data.
Only 1% are
realize they are
need 2 Big Data
for fault
tolerance
19. Big data skills gap
• Hardly anyone knows this stuff
• It’s a big field, with lots and lots of theory
• And it’s all maths, so it’s tricky to learn
19
http://wikibon.org/wiki/v/Big_Data:_Hadoop,_Business_Analytics_and_Beyond#The_Big_Data_Skills_Gap
http://www.ibmbigdatahub.com/blog/addressing-big-data-skills-gap
20. Two orthogonal aspects
20
• Analytics / machine learning
– learning insights from data
• Big data
– handling massive data volumes
• Can be combined, or used separately
22. How to process Big Data?
22
• If relational databases are not enough,
what is?
https://twitter.com/devops_borat
Mining of Big
Data is
problem solve
in 2013 with
zgrep
23. MapReduce
23
• A framework for writing massively parallel
code
• Simple, straightforward model
• Based on “map” and “reduce” functions
from functional programming (LISP)
24. NoSQL and Big Data
24
• Not really that relevant
• Traditional databases handle big data sets,
too
• NoSQL databases have poor analytics
• MapReduce often works from text files
– can obviously work from SQL and NoSQL, too
• NoSQL is more for high throughput
– basically, AP from the CAP theorem, instead of CP
• In practice, really Big Data is likely to be a
mix
– text files, NoSQL, and SQL
25. The 4th V: Veracity
25
“The greatest enemy of knowledge is not
ignorance, it is the illusion of knowledge.”
Daniel Borstin, in The Discoverers (1983)
https://twitter.com/devops_borat
95% of time,
when is clean Big
Data is get Little
Data
26. Data quality
• A huge problem in practice
– any manually entered data is suspect
– most data sets are in practice deeply problematic
• Even automatically gathered data can be a
problem
– systematic problems with sensors
– errors causing data loss
– incorrect metadata about the sensor
• Never, never, never trust the data without
checking it!
– garbage in, garbage out, etc
26
28. Conclusion
• Vast potential
– to both big data and machine learning
• Very difficult to realize that potential
– requires mathematics, which nobody knows
• We need to wake up!
28
30. Two kinds of learning
30
• Supervised
– we have training data with correct answers
– use training data to prepare the algorithm
– then apply it to data without a correct answer
• Unsupervised
– no training data
– throw data into the algorithm, hope it makes some
kind of sense out of the data
31. Some types of algorithms
• Prediction
– predicting a variable from data
• Classification
– assigning records to predefined groups
• Clustering
– splitting records into groups based on similarity
• Association learning
– seeing what often appears together with what
31
32. Issues
• Data is usually noisy in some way
– imprecise input values
– hidden/latent input values
• Inductive bias
– basically, the shape of the algorithm we choose
– may not fit the data at all
– may induce underfitting or overfitting
• Machine learning without inductive bias is
not possible
32
34. Overfitting
• Tuning the algorithm so carefully it starts
matching the noise in the training data
34
35. 35
“What if the knowledge and data we have
are not sufficient to completely determine
the correct classifier?Then we run the risk of
just hallucinating a classifier (or parts of it)
that is not grounded in reality, and is simply
encoding random quirks in the data.This
problem is called overfitting, and is the
bugbear of machine learning. When your
learner outputs a classifier that is 100%
accurate on the training data but only 50%
accurate on test data, when in fact it could
have output one that is 75% accurate on both,
it has overfit.”
http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf
36. Testing
36
• When doing this for real, testing is crucial
• Testing means splitting your data set
– training data (used as input to algorithm)
– test data (used for evaluation only)
• Need to compute some measure of
performance
– precision/recall
– root mean square error
• A huge field of theory here
– will not go into it in this course
– very important in practice
37. Missing values
37
• Usually, there are missing values in the
data set
– that is, some records have some NULL values
• These cause problems for many machine
learning algorithms
• Need to solve somehow
– remove all records with NULLs
– use a default value
– estimate a replacement value
– ...
38. Terminology
38
• Vector
– one-dimensional array
• Matrix
– two-dimensional array
• Linear algebra
– algebra with vectors and matrices
– addition, multiplication, transposition, ...
40. Top 10 machine learning algs
1. C4.5 No
2. k-means clustering Yes
3. Support vector machines No
4. the Apriori algorithm No
5. the EM algorithm No
6. PageRank No
7. AdaBoost No
8. k-nearest neighbours class. Kind of
9. Naïve Bayes Yes
10.CART No
40
From a survey at IEEE InternationalConference on Data Mining (ICDM) in December 2006. “Top 10
algorithms in data mining”, byX.Wu et al
41. C4.5
41
• Algorithm for building decision trees
– basically trees of boolean expressions
– each node split the data set in two
– leaves assign items to classes
• Decision trees are useful not just for
classification
– they can also teach you something about the
classes
• C4.5 is a bit involved to learn
– the ID3 algorithm is much simpler
• CART (#10) is another algorithm for
learning decision trees
42. Support Vector Machines
42
• A way to do binary classification on
matrices
• Support vectors are the data points nearest
to the hyperplane that divides the classes
• SVMs maximize the distance between SVs
and the boundary
• Particularly valuable because of “the kernel
trick”
– using a transformation to a higher dimension to
handle more complex class boundaries
• A bit of work to learn, but manageable
43. Apriori
43
• An algorithm for “frequent itemsets”
– basically, working out which items frequently
appear together
– for example, what goods are often bought
together in the supermarket?
– used forAmazon’s “customers who bought this...”
• Can also be used to find association rules
– that is, “people who buy X often buyY” or similar
• Apriori is slow
– a faster, further development is FP-growth
http://www.dssresources.com/newsletters/66.php
44. Expectation Maximization
44
• A deeply interesting algorithm I’ve seen
used in a number of contexts
– very hard to understand what it does
– very heavy on the maths
• Essentially an iterative algorithm
– skips between “expectation” step and
“maximization” step
– tries to optimize the output of a function
• Can be used for
– clustering
– a number of more specialized examples, too
45. PageRank
45
• Basically a graph analysis algorithm
– identifies the most prominent nodes
– used for weighting search results on Google
• Can be applied to any graph
– for example an RDF data set
• Basically works by simulating random walk
– estimating the likelihood that a walker would be
on a given node at a given time
– actual implementation is linear algebra
• The basic algorithm has some issues
– “spider traps”
– graph must be connected
– straightforward solutions to these exist
46. AdaBoost
46
• Algorithm for “ensemble learning”
• That is, for combining several algorithms
– and training them on the same data
• Combining more algorithms can be very
effective
– usually better than a single algorithm
• AdaBoost basically weights training
samples
– giving the most weight to those which are
classified the worst
48. Collaborative filtering
• Basically, you’ve got some set of items
– these can be movies, books, beers, whatever
• You’ve also got ratings from users
– on a scale of 1-5, 1-10, whatever
• Can you use this to recommend items to a
user, based on their ratings?
– if you use the connection between their ratings and
other people’s ratings, it’s called collaborative
filtering
– other approaches are possible
48
49. Feature-based recommendation
49
• Use user’s ratings of items
– run an algorithm to learn what features of items
the user likes
• Can be difficult to apply because
– requires detailed information about items
– key features may not be present in data
• Recommending music may be difficult, for
example
50. A simple idea
• If we can find ratings from people similar to
you, we can see what they liked
– the assumption is that you should also like it, since
your other ratings agreed so well
• You can take the average ratings of the k
people most similar to you
– then display the items with the highest averages
• This approach is called k-nearest neighbours
– it’s simple, computationally inexpensive, and works
pretty well
– there are, however, some tricks involved
50
51. MovieLens data
• Three sets of movie rating data
– real, anonymized data, from the MovieLens site
– ratings on a 1-5 scale
• Increasing sizes
– 100,000 ratings
– 1,000,000 ratings
– 10,000,000 ratings
• Includes a bit of information about the movies
• The two smallest data sets also contain
demographic information about users
51
http://www.grouplens.org/node/73
52. Basic algorithm
• Load data into rating sets
– a rating set is a list of (movie id, rating) tuples
– one rating set per user
• Compare rating sets against the user’s
rating set with a similarity function
– pick the k most similar rating sets
• Compute average movie rating within
these k rating sets
• Show movies with highest averages
52
53. Similarity functions
• Minkowski distance
– basically geometric distance, generalized to any
number of dimensions
• Pearson correlation coefficient
• Vector cosine
– measures angle between vectors
• Root mean square error (RMSE)
– square root of the mean of square differences
between data values
53
54. Data I added
54
User
ID
Movie
ID
Rating Title
6041 347 4 Bitter Moon
6041 1680 3 Sliding Doors
6041 229 5 Death and the Maiden
6041 1732 3 The Big Lebowski
6041 597 2 Pretty Woman
6041 991 4 Michael Collins
6041 1693 3 Amistad
6041 1484 4 The Daytrippers
6041 427 1 Boxing Helena
6041 509 4 The Piano
6041 778 5 Trainspotting
6041 1204 4 Lawrence of Arabia
6041 1263 5 The Deer Hunter
6041 1183 5 The English Patient
6041 1343 1 Cape Fear
6041 260 1 Star Wars
6041 405 1 Highlander III
6041 745 5 A Close Shave
6041 1148 5 The Wrong Trousers
6041 1721 1 Titanic
This is the 1M data set
https://github.com/larsga/py-snippets/tree/master/machine-learning/movielens
Note these. Later we’ll seeWallace &
Gromit popping up in recommendations.
55. Root Mean Square Error
• This is a measure that’s often used to judge
the quality of prediction
– predicted value: x
– actual value: y
• For each pair of values, do
– (y - x)2
• Procedure
– sum over all pairs,
– divide by the number of values (to get average),
– take the square root of that (to undo squaring)
• We use the square because
– that always gives us a positive number,
– it emphasizes bigger deviations
55
56. RMSE in Python
def rmse(rating1, rating2):
sum = 0
count = 0
for (key, rating) in rating1.items():
if key in rating2:
sum += (rating2[key] - rating) ** 2
count += 1
if not count:
return 1000000 # no common ratings, so distance is huge
return sqrt(sum / float(count))
56
57. Output, k=3
===== User 0 ==================================================
User # 14 , distance: 0.0
Deer Hunter, The (1978) 5 YOUR: 5
===== User 1 ==================================================
User # 68 , distance: 0.0
Close Shave, A (1995) 5 YOUR: 5
===== User 2 ==================================================
User # 95 , distance: 0.0
Big Lebowski, The (1998) 3 YOUR: 3
===== RECOMMENDATIONS =============================================
Chicken Run (2000) 5.0
Auntie Mame (1958) 5.0
Muppet Movie, The (1979) 5.0
'Night Mother (1986) 5.0
Goldfinger (1964) 5.0
Children of Paradise (Les enfants du paradis) (1945) 5.0
Total Recall (1990) 5.0
Boys Don't Cry (1999) 5.0
Radio Days (1987) 5.0
Ideal Husband, An (1999) 5.0
Red Violin, The (Le Violon rouge) (1998) 5.0
57
Distance measure: RMSE
Obvious problem: ratings agree perfectly,
but there are too few common ratings. More
ratings mean greater chance of disagreement.
58. RMSE 2.0
def lmg_rmse(rating1, rating2):
max_rating = 5.0
sum = 0
count = 0
for (key, rating) in rating1.items():
if key in rating2:
sum += (rating2[key] - rating) ** 2
count += 1
if not count:
return 1000000 # no common ratings, so distance is huge
return sqrt(sum / float(count)) + (max_rating / count)
58
59. Output, k=3, RMSE 2.0
===== 0 ==================================================
User # 3320 , distance: 1.09225018729
Highlander III: The Sorcerer (1994) 1 YOUR: 1
Boxing Helena (1993) 1 YOUR: 1
Pretty Woman (1990) 2 YOUR: 2
Close Shave, A (1995) 5 YOUR: 5
Michael Collins (1996) 4 YOUR: 4
Wrong Trousers, The (1993) 5 YOUR: 5
Amistad (1997) 4 YOUR: 3
===== 1 ==================================================
User # 2825 , distance: 1.24880819811
Amistad (1997) 3 YOUR: 3
English Patient, The (1996) 4 YOUR: 5
Wrong Trousers, The (1993) 5 YOUR: 5
Death and the Maiden (1994) 5 YOUR: 5
Lawrence of Arabia (1962) 4 YOUR: 4
Close Shave, A (1995) 5 YOUR: 5
Piano, The (1993) 5 YOUR: 4
===== 2 ==================================================
User # 1205 , distance: 1.41068360252
Sliding Doors (1998) 4 YOUR: 3
English Patient, The (1996) 4 YOUR: 5
Michael Collins (1996) 4 YOUR: 4
Close Shave, A (1995) 5 YOUR: 5
Wrong Trousers, The (1993) 5 YOUR: 5
Piano, The (1993) 4 YOUR: 4
===== RECOMMENDATIONS ==================================================
Patriot, The (2000) 5.0
Badlands (1973) 5.0
Blood Simple (1984) 5.0
Gold Rush, The (1925) 5.0
Mission: Impossible 2 (2000) 5.0
Gladiator (2000) 5.0
Hook (1991) 5.0
Funny Bones (1995) 5.0
Creature Comforts (1990) 5.0
Do the Right Thing (1989) 5.0
Thelma & Louise (1991) 5.0
59
Much better choice of users
But all recommended movies are 5.0
Basically, if one user gave it 5.0, that’s
going to beat 5.0, 5.0, and 4.0
Clearly, we need to reward movies that
have more ratings somehow
60. Bayesian average
• A simple weighted average that accounts
for how many ratings there are
• Basically, you take the set of ratings and
add n extra “fake” ratings of the average
value
• So for movies, we use the average of 3.0
60
(sum(numbers) + (3.0 * n))
float(len(numbers) + n)
>>> avg([5.0], 2)
3.6666666666666665
>>> avg([5.0, 5.0], 2)
4.0
>>> avg([5.0, 5.0, 5.0], 2)
4.2
>>> avg([5.0, 5.0, 5.0, 5.0], 2)
4.333333333333333
61. With k=3
===== RECOMMENDATIONS ===============
Truman Show,The (1998) 4.2
Say Anything... (1989) 4.0
Jerry Maguire (1996) 4.0
Groundhog Day (1993) 4.0
Monty Python and the Holy Grail (1974) 4.0
Big Night (1996) 4.0
Babe (1995) 4.0
What About Bob? (1991) 3.75
Howards End (1992) 3.75
Winslow Boy,The (1998) 3.75
Shakespeare in Love (1998) 3.75
61
Not very good, but k=3 makes us
very dependent on those specific 3
users.
62. With k=10
===== RECOMMENDATIONS ===============
Groundhog Day (1993) 4.55555555556
Annie Hall (1977) 4.4
One Flew Over the Cuckoo's Nest (1975) 4.375
Fargo (1996) 4.36363636364
Wallace & Gromit:The Best of Aardman
Animation (1996) 4.33333333333
Do the RightThing (1989) 4.28571428571
Princess Bride,The (1987) 4.28571428571
Welcome to the Dollhouse (1995) 4.28571428571
Wizard of Oz,The (1939) 4.25
Blood Simple (1984) 4.22222222222
Rushmore (1998) 4.2
62
Definitely better.
63. With k=50
===== RECOMMENDATIONS ===============
Wallace & Gromit:The Best of AardmanAnimation
(1996) 4.55
Roger & Me (1989) 4.5
Waiting for Guffman (1996) 4.5
Grand Day Out, A (1992) 4.5
Creature Comforts (1990) 4.46666666667
Fargo (1996) 4.46511627907
Godfather,The (1972) 4.45161290323
Raising Arizona (1987) 4.4347826087
City Lights (1931) 4.42857142857
Usual Suspects,The (1995) 4.41666666667
Manchurian Candidate,The (1962) 4.41176470588
63
64. With k = 2,000,000
• If we did that, what results would we get?
64
65. Normalization
• People use the scale differently
– some give only 4s and 5s
– others give only 1s
– some give only 1s and 5s
– etc
• Should have normalized user ratings before
using them
– before comparison
– and before averaging ratings from neighbours
65
67. Bayes’s Theorem
67
• Basically a theorem for combining
probabilities
– I’ve observed A, which indicates H is true with
probability 70%
– I’ve also observed B, which indicates H is true with
probability 85%
– what should I conclude?
• Naïve Bayes is basically using this theorem
– with the assumption that A and B are indepedent
– this assumption is nearly always false, hence
“naïve”
68. Simple example
68
• Is the coin fair or not?
– we throw it 10 times, get 9 heads and one tail
– we try again, get 8 heads and two tails
• What do we know now?
– can combine data and recompute
– or just use Bayes’sTheorem directly
http://www.bbc.co.uk/news/magazine-22310186
>>> compute_bayes([0.92, 0.84])
0.9837067209775967
69. Ways I’ve used Bayes
69
• Duke
– record deduplication engine
– estimate probability of duplicate for each property
– combine probabilities with Bayes
• Whazzup
– news aggregator that finds relevant news
– works essentially like spam classifier on next slide
• Tine recommendation prototype
– recommends recipes based on previous choices
– also like spam classifier
• Classifying expenses
– using export from my bank
– also like spam classifier
70. Bayes against spam
70
• Take a set of emails, divide it into spam and
non-spam (ham)
– count the number of times a feature appears in
each of the two sets
– a feature can be a word or anything you please
• To classify an email, for each feature in it
– consider the probability of email being spam given
that feature to be (spam count) / (spam count +
ham count)
– ie: if “viagra” appears 99 times in spam and 1 in
ham, the probability is 0.99
• Then combine the probabilities with Bayes
http://www.paulgraham.com/spam.html
71. Running the script
71
• I pass it
– 1000 emails from my Bouvet folder
– 1000 emails from my Spam folder
• Then I feed it
– 1 email from another Bouvet folder
– 1 email from another Spam folder
72. Code
72
# scan spam
for spam in glob.glob(spamdir + '/' + PATTERN)[ : SAMPLES]:
for token in featurize(spam):
corpus.spam(token)
# scan ham
for ham in glob.glob(hamdir + '/' + PATTERN)[ : SAMPLES]:
for token in featurize(ham):
corpus.ham(token)
# compute probability
for email in sys.argv[3 : ]:
print email
p = classify(email)
if p < 0.2:
print ' Spam', p
else:
print ' Ham', p
https://github.com/larsga/py-snippets/tree/master/machine-learning/spam
76. More solid testing
76
• Using the SpamAssassin public corpus
• Training with 500 emails from
– spam
– easy_ham (2002)
• Test results
– spam_2: 1128 spam, 269 misclassified as ham
– easy_ham 2003: 2283 ham, 217 spam
• Results are pretty good for 30 minutes of
effort...
http://spamassassin.apache.org/publiccorpus/
78. Linear regression
78
• Let’s say we have a number of numerical
parameters for an object
• We want to use these to predict some
other value
• Examples
– estimating real estate prices
– predicting the rating of a beer
– ...
79. Estimating real estate prices
79
• Take parameters
– x1 square meters
– x2 number of rooms
– x3 number of floors
– x4 energy cost per year
– x5 meters to nearest subway station
– x6 years since built
– x7 years since last refurbished
– ...
• a x1 + b x2 + c x3 + ... = price
– strip out the x-es and you have a vector
– collect N samples of real flats with prices = matrix
– welcome to the world of linear algebra
80. Our data set: beer ratings
80
• Ratebeer.com
– a web site for rating beer
– scale of 0.5 to 5.0
• For each beer we know
– alcohol %
– country of origin
– brewery
– beer style (IPA, pilsener, stout, ...)
• But ... only one attribute is numeric!
– how to solve?
81. Example
81
ABV .se .nl .us .uk IIPA Black
IPA
Pale
ale
Bitter Rating
8.5 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 3.5
8.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 3.7
6.2 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 3.2
4.4 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 3.2
... ... ... ... ... ... ... ... ... ...
Basically, we turn each category into a column of 0.0 or 1.0 values.
82. Normalization
82
• If some columns have much bigger values than
the others they will automatically dominate
predictions
• We solve this by normalization
• Basically, all values get resized into the 0.0-1.0
range
• For ABV we set a ceiling of 15%
– compute with min(15.0, abv) / 15.0
83. Adding more data
83
• To get a bit more data, I added manually a
description of each beer style
• Each beer style got a 0.0-1.0 rating on
– colour (pale/dark)
– sweetness
– hoppiness
– sourness
• These ratings are kind of coarse because all
beers of the same style get the same value
84. Making predictions
84
• We’re looking for a formula
– a * abv + b * .se + c * .nl + d * .us + ... = rating
• We have n examples
– a * 8.5 + b * 1.0 + c * 0.0 + d * 0.0 + ... = 3.5
• We have one unknown per column
– as long as we have more rows than columns we can
solve the equation
• Interestingly, matrix operations can be used to
solve this easily
85. Matrix formulation
85
• Let’s say
– x is our data matrix
– y is a vector with the ratings and
– w is a vector with the a, b, c, ... values
• That is: x * w = y
– this is the same as the original equation
– a x1 + b x2 + c x3 + ... = rating
• If we solve this, we get
86. Enter Numpy
86
• Numpy is a Python library for matrix
operations
• It has built-in types for vectors and matrices
• Means you can very easily work with matrices
in Python
• Why matrices?
– much easier to express what we want to do
– library written in C and very fast
– takes care of rounding errors, etc
88. Numpy solution
88
• We load the data into
– a list: scores
– a list of lists: parameters
• Then:
x_mat = mat(parameters)
y_mat = mat(scores).T
x_tx = x_mat.T * x_mat
assert linalg.det(x_tx)
ws = x_tx.I * (x_mat.T * y_mat)
89. Does it work?
89
• We only have very rough information about
each beer (abv, country, style)
– so very detailed prediction isn’t possible
– but we should get some indication
• Here are the results based on my ratings
– 10% imperial stout from US 3.9
– 4.5% pale lager from Ukraine 2.8
– 5.2% German schwarzbier 3.1
– 7.0% German doppelbock 3.5
http://www.ratebeer.com/user/15206/ratings/
90. Beyond prediction
90
• We can use this for more than just prediction
• We can also use it to see which columns
contribute the most to the rating
– that is, which aspects of a beer best predict the rating
• If we look at the w vector we see the following
– Aspect LMG grove
– ABV 0.56 1.1
– colour 0.46 0.42
– sweetness 0.25 0.51
– hoppiness 0.45 0.41
– sourness 0.29 0.87
• Could also use correlation
91. Did we underfit?
• Who says the relationship between ABV
and the rating is linear?
– perhaps very low and very high ABV are both
negative?
– we cannot capture that with linear regression
• Solution
– add computed columns for parameters raised to
higher powers
– abv2, abv3, abv4, ...
– beware of overfitting...
91
94. Matrix factorization
94
• Another way to do recommendations is
matrix factorization
– basically, make a user/item matrix with ratings
– try to find two smaller matrices that, when
multiplied together, give you the original matrix
– that is, original with missing values filled in
• Why that works?
– I don’t know
– I tried it, couldn’t get it to work
– therefore we’re not covering it
– known to be a very good method, however
96. Clustering
• Basically, take a set of objects and sort
them into groups
– objects that are similar go into the same group
• The groups are not defined beforehand
• Sometimes the number of groups to create
is input to the algorithm
• Many, many different algorithms for this
96
97. Sample data
• Our sample data set is data about aircraft from
DBpedia
• For each aircraft model we have
– name
– length (m)
– height (m)
– wingspan (m)
– number of crew members
– operational ceiling, or max height (m)
– max speed (km/h)
– empty weight (kg)
• We use a subset of the data
– 149 aircraft models which all have values for all of these
properties
• Also, all values normalized to the 0.0-1.0 range
97
98. Distance
• All clustering algorithms require a distance
function
– that is, a measure of similarity between two objects
• Any kind of distance function can be used
– generally, lower values mean more similar
• Examples of distance functions
– metric distance
– vector cosine
– RMSE
– ...
98
99. k-means clustering
• Input: the number of clusters to create (k)
• Pick k objects
– these are your initial clusters
• For all objects, find nearest cluster
– assign the object to that cluster
• For each cluster, compute mean of all
properties
– use these mean values to compute distance to
clusters
– the mean is often referred to as a “centroid”
– go back to previous step
• Continue until no objects change cluster
99
100. First attempt at aircraft
• We leave out name and number built when
doing comparison
• We use RMSE as the distance measure
• We set k = 5
• What happens?
– first iteration: all 149 assigned to a cluster
– second: 11 models change cluster
– third: 7 change
– fourth: 5 change
– fifth: 5 change
– sixth: 2
– seventh: 1
– eighth: 0
100
101. Cluster 5
101
cluster5, 4 models
ceiling : 13400.0
maxspeed : 1149.7
crew : 7.5
length : 47.275
height : 11.65
emptyweight : 69357.5
wingspan : 47.18
The Myasishchev M-50 was a Soviet
prototype four-engine supersonic
bomber which never attained service
TheTupolevTu-16 was a twin-engine
jet bomber used by the Soviet Union.
The Myasishchev M-4 Molot is a
four-engined strategic bomber
TheConvair B-36 "Peacemaker” was a
strategic bomber built by Convair and
operated solely by the United StatesAir
Force (USAF) from 1949 to 1959
3 jet bombers, one
propeller bomber.
Not too bad.
102. Cluster 4
102
cluster4, 56 models
ceiling : 5898.2
maxspeed : 259.8
crew : 2.2
length : 10.0
height : 3.3
emptyweight : 2202.5
wingspan : 13.8
TheAvia B.135 was a Czechoslovak
cantilever monoplane fighter aircraft
The NorthAmerican B-25 Mitchell was
anAmerican twin-engined medium
bomber
TheYakovlev UT-1 was a single-seater
trainer aircraft
TheYakovlev UT-2 was a single-seater
trainer aircraft
The Siebel Fh 104 Hallore was a small
German twin-engined transport,
communications and liaison aircraft
The Messerschmitt Bf 108Taifun was a
German single-engine sports and touring
aircraft
TheAirco DH.2 was a single-seat
biplane "pusher" aircraft
Small, slow propeller aircraft.
Not too bad.
103. Cluster 3
103
cluster3, 12 models
ceiling : 16921.1
maxspeed : 2456.9
crew : 2.67
length : 17.2
height : 4.92
emptyweight : 9941
wingspan : 10.1
The Mikoyan MiG-29 is a fourth-
generation jet fighter aircraft
TheVought F-8 Crusader was a
single-engine, supersonic [fighter]
aircraft
The English Electric Lightning is a
supersonic jet fighter aircraft of the
ColdWar era, noted for its great
speed.
The Dassault Mirage 5 is a supersonic
attack aircraft
The NorthropT-38Talon is a two-
seat, twin-engine supersonic jet
trainer
The Mikoyan MiG-35 is a further
development of the MiG-29
Small, very fast jet planes.
Pretty good.
104. Cluster 2
104
cluster2, 27 models
ceiling : 6447.5
maxspeed : 435
crew : 5.4
length : 24.4
height : 6.7
emptyweight : 16894
wingspan : 32.8
The Bartini BerievVVA-14 (vertical
take-off amphibious aircraft)
TheAviationTradersATL-98
Carvair was a large piston-engine
transport aircraft.
The Junkers Ju 290 was a long-range transport,
maritime patrol aircraft and heavy bomber
The Fokker 50 is a turboprop-
powered airliner
The PB2Y Coronado was a large
flying boat patrol bomber
The Junkers Ju 89 was a heavy
bomber
The Beriev Be-200 Altair is a
multipurpose amphibious aircraft
Biggish, kind of slow planes.
Some oddballs in this group.
105. Cluster 1
105
cluster1, 50 models
ceiling : 11612
maxspeed : 726.4
crew : 1.6
length : 11.9
height : 3.8
emptyweight : 5303
wingspan : 13
TheAdamA700AdamJet was a
proposed six-seat civil utility aircraft
The Learjet 23 is a ... twin-engine,
high-speed business jet
The Learjet 24 is a ... twin-engine,
high-speed business jet
TheCurtiss P-36 Hawk was an American-
designed and built fighter aircraft
The Kawasaki Ki-61 Hien was a
Japanese WorldWar II fighter aircraft
TheGrumman F3F was the last
American biplane fighter aircraft
The English ElectricCanberra is a
first-generation jet-powered light
bomber
The Heinkel He
100 was a
German pre-
WorldWar II
fighter aircraft
Small, fast planes. Mostly
good, though the Canberra is
a poor fit.
106. Clusters, summarizing
• Cluster 1: small, fast aircraft (750 km/h)
• Cluster 2: big, slow aircraft (450 km/h)
• Cluster 3: small, very fast jets (2500 km/h)
• Cluster 4: small, very slow planes (250 km/h)
• Cluster 5: big, fast jet planes (1150 km/h)
106
For a first attempt to sort through the data,
this is not bad at all
https://github.com/larsga/py-snippets/tree/master/machine-learning/aircraft
107. Agglomerative clustering
• Put all objects in a pile
• Make a cluster of the two objects closest to
one another
– from here on, treat clusters like objects
• Repeat second step until satisfied
107 There is code for this, too, in the Github sample
109. PCA
109
• Basically, using eigenvalue analysis to find
out which variables contain the most
information
– the maths are pretty involved
– and I’ve forgotten how it works
– and I’ve thrown out my linear algebra book
– and ordering a new one fromAmazon takes too
long
– ...so we’re going to do this intuitively
110. An example data set
110
• Two variables
• Three classes
• What’s the longest
line we could draw
through the data?
• That line is a vector in two dimensions
• What dimension dominates?
– that’s right: the horizontal
– this implies the horizontal contains most of the
information in the data set
• PCA identifies the most significant
variables
111. Dimensionality reduction
111
• After PCA we know which dimensions
matter
– based on that information we can decide to throw
out less important dimensions
• Result
– smaller data set
– faster computations
– easier to understand
112. Trying out PCA
112
• Let’s try it on the Ratebeer data
• We know ABV has the most information
– because it’s the only value specified for each
individual beer
• We also include a new column: alcohol
– this is the amount of alcohol in a pint glass of the
beer, measured in centiliters
– this column basically contains no information at
all; it’s computed from the abv column
113. Complete code
113
import rblib
from numpy import *
def eigenvalues(data, columns):
covariance = cov(data - mean(data, axis = 0), rowvar = 0)
eigvals = linalg.eig(mat(covariance))[0]
indices = list(argsort(eigvals))
indices.reverse() # so we get most significant first
return [(columns[ix], float(eigvals[ix])) for ix in indices]
(scores, parameters, columns) =
rblib.load_as_matrix('ratings.txt')
for (col, ev) in eigenvalues(parameters, columns):
print "%40s %s" % (col, float(ev))
116. University pre-lecture, 1991
116
• My first meeting with university was Open
University Day, in 1991
• Professor Bjørn Kirkerud gave the computer
science talk
• His subject
– some day processors will stop becoming faster
– we’re already building machines with many processors
– what we need is a way to parallelize software
– preferably automatically, by feeding in normal source
code and getting it parallelized back
• MapReduce is basically the state of the art on
that today
117. MapReduce
117
• A framework for writing massively parallel
code
• Simple, straightforward model
• Based on “map” and “reduce” functions
from functional programming (LISP)
120. MapReduce
120
1. Split data into fragments
2. Create a Map task for each fragment
– the task outputs a set of (key, value) pairs
3. Group the pairs by key
4. Call Reduce once for each key
– all pairs with same key passed in together
– reduce outputs new (key, value) pairs
Tasks get spread out over worker nodes
Master node keeps track of completed/failed tasks
Failed tasks are restarted
Failed nodes are detected and avoided
Also scheduling tricks to deal with slow nodes
121. Communications
121
• HDFS
– Hadoop Distributed File System
– input data, temporary results, and results are
stored as files here
– Hadoop takes care of making files available to
nodes
• Hadoop RPC
– how Hadoop communicates between nodes
– used for scheduling tasks, heartbeat etc
• Most of this is in practice hidden from the
developer
122. Does anyone need MapReduce?
122
• I tried to do book recommendations with
linear algebra
• Basically, doing matrix multiplication to
produce the full user/item matrix with
blanks filled in
• My Mac wound up freezing
• 185,973 books x 77,805 users =
14,469,629,265
– assuming 2 bytes per float = 28 GB of RAM
• So it doesn’t necessarily take that much to
have some use for MapReduce
123. The word count example
123
• Classic example of using MapReduce
• Takes an input directory of text files
• Processes them to produce word frequency
counts
• To start up, copy data into HDFS
– bin/hadoop dfs -mkdir <hdfs-dir>
– bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-
dir>
124. WordCount – the mapper
124
public static class Map extends Mapper<LongWritable,
Text,Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
privateText word = newText();
public void map(LongWritable key,Text value, Context
context) {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
By default, Hadoop will scan all text files in input directory
Each line in each file will become a mapper task
And thus a “Text value” input to a map() call
125. WordCount – the reducer
125
public static class Reduce extends Reducer<Text,
IntWritable,Text, IntWritable> {
public void reduce(Text key,
Iterable<IntWritable> values, Context context) {
int sum = 0;
for (IntWritable val : values)
sum += val.get();
context.write(key, new IntWritable(sum));
}
}
126. The Hadoop ecosystem
126
• Pig
– dataflow language for setting up MR jobs
• HBase
– NoSQL database to store MR input in
• Hive
– SQL-like query language on top of Hadoop
• Mahout
– machine learning library on top of Hadoop
• Hadoop Streaming
– utility for writing mappers and reducers as
command-line tools in other languages
127. Word count in HiveQL
CREATETABLE input (line STRING);
LOAD DATA LOCAL INPATH 'input.tsv' OVERWRITE INTOTABLE
input;
-- temporary table to hold words...
CREATETABLE words (word STRING);
add file splitter.py;
INSERT OVERWRITETABLE words
SELECTTRANSFORM(text)
USING 'python splitter.py'
AS word
FROM input;
SELECT word, COUNT(*)
FROM input
LATERALVIEW explode(split(text, ' ')) lTable as word
GROUP BY word;
127
128. Word count in Pig
input_lines = LOAD '/tmp/my-copy-of-all-pages-on-internet'AS (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line))AS word;
-- filter out any words that are just white spaces
filtered_words = FILTER words BY word MATCHES 'w+';
-- create a group for each word
word_groups = GROUP filtered_words BY word;
-- count the entries in each group
word_count = FOREACH word_groups GENERATE COUNT(filtered_words)AS
count, groupAS word;
-- order the records by count
ordered_word_count = ORDER word_count BY count DESC;
STORE ordered_word_count INTO '/tmp/number-of-words-on-internet';
128
129. Applications of MapReduce
129
• Linear algebra operations
– easily mapreducible
• SQL queries over heterogeneous data
– basically requires only a mapping to tables
– relational algebra easy to do in MapReduce
• PageRank
– basically one big set of matrix multiplications
– the original application of MapReduce
• Recommendation engines
– the SON algorithm
• ...
130. Apache Mahout
130
• Has three main application areas
– others are welcome, but this is mainly what’s there
now
• Recommendation engines
– several different similarity measures
– collaborative filtering
– Slope-one algorithm
• Clustering
– k-means and fuzzy k-means
– Latent Dirichlet Allocation
• Classification
– stochastic gradient descent
– SupportVector Machines
– Naïve Bayes
131. SQL to relational algebra
131
select lives.person_name, city
from works, lives
where company_name = ’FBC’ and
works.person_name = lives.person_name
132. Translation to MapReduce
132
• σ(company_name=‘FBC’, works)
– map: for each record r in works, verify the condition,
and pass (r, r) if it matches
– reduce: receive (r, r) and pass it on unchanged
• π(person_name, σ(...))
– map: for each record r in input, produce a new record r’
with only wanted columns, pass (r’, r’)
– reduce: receive (r’, [r’, r’, r’ ...]), output (r’, r’)
• ⋈(π(...), lives)
– map:
• for each record r in π(...), output (person_name, r)
• for each record r in lives, output (person_name, r)
– reduce: receive (key, [record, record, ...]), and perform
the actual join
• ...
133. Lots of SQL-on-MapReduce tools
133
• Tenzing Google
• Hive Apache Hadoop
• YSmart Ohio State
• SQL-MR AsterData
• HadoopDB Hadapt
• Polybase Microsoft
• RainStor RainStor Inc.
• ParAccel ParAccel Inc.
• Impala Cloudera
• ...
135. Big data & machine learning
135
• This is a huge field, growing very fast
• Many algorithms and techniques
– can be seen as a giant toolbox with wide-ranging
applications
• Ranging from the very simple to the
extremely sophisticated
• Difficult to see the big picture
• Huge range of applications
• Math skills are crucial