The document is an agenda for a seminar on machine learning techniques and tools. It will cover an introduction to machine learning, common techniques like classification, clustering and regression. It will also discuss tools for machine learning like Apache Mahout, Weka, Spark MLLib and R. Finally, it will include a hands-on demonstration of machine learning algorithms and discuss benefits of using machine learning.
1. Data Science Company
Machine Learning in Practice
An InfoFarm Seminar
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
2. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Data
Science
Big
Data
Identifying, extracting and using data of all types
and origins; exploring, correlating and using it in new
and innovative ways in order to extract meaning
and business value from it.
3. 2 Data Scientists 4 Big Data
Consultants
1 Infrastructure
Specialist
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
4. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Java
PHP
E-Commerce
Mobile
Web
Development
9. Machine Learning is a subfield of
computer science and statistics that deals
with systems that can learn from data,
instead of follow explicitly programmed
instructions.
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
10. Machine Learning vs Data Science vs Big Data
• You don’t need Big Data to leverage the
benefits of machine learning, but more
learning data makes a better machine
• Data Science can help you to get the most
out of Machine Learning
• Machine Learning can help you to get the
most out of Data Science
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
12. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Terminology
Weight (g) Wingspan (cm) Webbed feet? Back color Species
1000.1 125.0 No Brown Buteo jamaicenis
3000.7 200.0 No Gray Sagittarius serpentarius
3300.0 220.3 No Gray Sagittarius serpentarius
4100.0 136.0 Yes Black Gavia Immer
3.0 11.0 No Green Colothorax lucifer
570.0 75.0 No Black Campephilus principalic
• Features / attributes
• Instance / data point
• Label / target variable
• Factorial versus Numeric versus Binary data
21. Association Rule Learning: Use Cases
• Recommendations
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Data exploration
• Find connections between unrelated
events
• Frequent pattern mining
22. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Regression
• Prediction of a quantity
• Algorithms:
– Linear regression
– Logistic regression
23. Regression: Use Cases
• Order Quantity Prediction
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
• Lag analysis
• Trend estimation
24. Information Extraction
• Extract variables out of unstructured data
like text.
• Named Entity Extraction
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
27. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Mahout
Pro Contra
Relatively stable Poor documentation
Build on Hadoop – Scales well Mahout is currently migrating from
Apache Hadoop to Apache Spark.
Development is slow and Apache Spark
already built a machine learning library of
their own… Instant legacy?
Command-line access for most algorithms Kind of slow for smaller use cases
All important algorithms are available
29. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Weka
Pro Contra
A lot of algorithms are available Not ‘Big Data’ ready
Graphical user interface for prototyping
and experimenting
Requires custom data format – ARRF-files
Available as a Java library Optimized for academic use cases
31. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Apache Spark: MLLib
Pro Contra
Based on Apache Spark – Very, fast and
scalable
Based on Apache Spark – Requires
knowledge of Spark and Scala
Very fast development cycle, new features
are rolling out every couple of months
Relatively new, so a small choice of
algorithms. But the essential ones are
there.
New and refreshing API, easy integration
with other components of Apache Spark.
33. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
R
Pro Contra
A lot of algorithms are available Can run on Hadoop/Spark, but requires a
lot of knowledge from both platforms
Well documented Must learn a new language
Lot’s of existing packages, that are easily
available
35. Integration with Software Development
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
36. Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
Development Cycle
Collect Analyze Extract Train Test Use
37. Feature extraction
• Describe an instance to be
used in an algorithm
• Recognize hand-written digits
by converting the images to
lines of 1’s and 0’s
00000000000001000000000000000000
00000000000001110000000000000000
00000000000011110000000000000000
00000000001111100000000000000000
00000000001111000000000000000000
00000000000111100000000000000000
00000000001111100000000000000000
00000000011111000000000000000000
00000000011110000000000000000000
00000000111110000000000000000000
00000000011111000000000000000000
00000000111111000000000000000000
00000000111110000000000000000000
00000000111100000000000000000000
00000000011110000000000000000000
00000000111110000111000000000000
00000001111111111111111100000000
00000001111111111111111110000000
00000001111111111111111110000000
00000000111111111111111111100000
00000001111111110000011111100000
00000001111100000000000111100000
00000000111100000000000111100000
00000000011110000000000011110000
00000000011111000000000011110000
00000000011111100000001111110000
00000000011111111111111111110000
00000000011111111111111111100000
00000000000111111111111111100000
00000000000011111111111111100000
00000000000000111111111000000000
00000000000000001111110000000000
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
38. Training an algorithm
1. Collect you’re data as a collection of
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
instances
2. Split you’re data set into a training set
and a testing set
3. Train the algorithm with the training set
4. Validate the results using the test set
39. Runtime model
• During training most algorithms generate a
mathematical runtime model.
• Model should be updated on a regular
basis
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
40. A / B Testing
• Slow integration in the main system.
• If the machine is certain (enough) the
machine can take over
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be
43. What’s in it for you?
Veldkant 33A, Kontich ● info@infofarmDa.btae S●ciwewncwe. inCfoomfaprman.bye
44. Benefits of using machine learning
• Automate repetitive tasks
• Can be a solution for problems that are
difficult to automate
• Gain insights about your business
• Optimize business decisions by using the
opinion of the computer
Veldkant 33A, Kontich ● info@infofarm.be ● www.infofarm.be