Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- AI and Machine Learning Demystified... by Carol Smith 3107184 views
- The AI Rush by Jean-Baptiste Dumont 808369 views
- 10 facts about jobs in the future by Pew Research Cent... 463294 views
- 2017 holiday survey: An annual anal... by Deloitte United S... 759284 views
- Harry Surden - Artificial Intellige... by Harry Surden 420473 views
- Inside Google's Numbers in 2017 by Rand Fishkin 978007 views

357 views

Published on

Hear how, in order to bin continuous columns, ESI Group employs the open source Minimum Description Length Principal (MDLP) library developed by Sergio Ramirez based on a paper by Fayyad and Irani. It is capable of using Spark to do entropy-based binning on many columns simultaneously with respect to a specific target. Once continuous columns have been binned and indexed, they can be processed by Spark’s Naive Bayes algorithm.

Published in:
Data & Analytics

No Downloads

Total views

357

On SlideShare

0

From Embeds

0

Number of Embeds

1

Shares

0

Downloads

26

Comments

0

Likes

1

No embeds

No notes for slide

- 1. Barry Becker, ESI-Group VISUALIZATION OF ENHANCED SPARK INDUCED NAIVE BAYES CLASSIFIER
- 2. Overview • What the Naïve Bayes Classifier is • How it can be visualized, and why you would want to • Limitations of the version in spark, how to get around them • Demo and use cases
- 3. What is a Naïve Bayes Classifier? • A probabilistic model that can be used to predict a categorical value (the target) • Induced using training data and evaluated with test data Inducer Classifier Training Data New Data Predictions
- 4. What is a Naïve Bayes Classifier? • Advantages – Fast and simple – Easy to visualize • Disadvantage – Assumes features are independent
- 5. Use Case: Detecting Poisonous Mushrooms Expert determines edibility based on cap shape, odor, gill spacing, stalk root, veil type, veil color, ring type, spore print color, habitat, gill size, stalk shape, … many more…. for 8,124 cases https://www.pinterest.com/pin/239253798926772284
- 6. Bayes’ Rule P(Edible | X) = P(X | Edible) P(Edible) / P(X) Where X is Odor = none and Spore print color = chocolate and Ring type = evanescent Bayes Rule Demo
- 7. Understand model with Visualization Everything you see is based on counts
- 8. How the Model Makes Predictions Multiplies conditional probabilities to come up with a posterior probability
- 9. How the Model Makes Predictions Multiplies conditional probabilities to come up with a posterior probability
- 10. Limitations of Spark Naïve Bayes Out-of-the-box version is only for document classification! Document Classification Approach • Each feature represents a word • For each feature, determine frequency for each class More General Approach • Each feature is a column • For each value of each feature, determine class distribution =* …** Prediction: Not Spam loan viagra zebra =** Prediction: Poisonousevanescent none chocolate Ring Type: Odor: SP Color: spam not spam edible poisonous
- 11. Limitations of Spark Naïve Bayes All columns must be non-negative integers – But what if there are nulls? – But what if there are strings? – But what if there are floating point values? – But what if there are dates? – What if target is continuous?
- 12. How to Handle String Columns? • First replace Nulls with a special Null value • Use StringIndexers to map string values to integer indices • Lastly, use IndexToString transformer to restore the predicted value back to a string.
- 13. How to Handle Continuous Columns? • Use MDLP Discretization to intelligently bin continuous (number or date) columns with respect to the target – Entropy based binning makes distributions of adjacent bins as different as possible • Replace Nulls with NaN so they can be in a separate bin.
- 14. Example: Job Type from Application
- 15. MDLP applied to a continuous column MDLP Binning 30 Uniform Bins HistogramView
- 16. MDLP applied to continuous column MDLP Binning 30 Uniform Bins SpinogramView
- 17. MDLP Binning of Continuous Features Splits seek to maximize the information, as measured by entropy Recursively split bins until – Information gain is too small – Bins are getting too small (avoids overfitting) Open Source https://github.com/sramirez/spark-MDLP-discretization
- 18. What if continuous Target? Automatically bin a continuous target into 3 equal weight bins
- 19. Mineset Demo
- 20. Pixel-Perfect Rendering Anti-aliased rendering Pixel-perfect rendering
- 21. Pixel-Perfect Rendering Anti-aliased rendering Pixel-perfect rendering
- 22. Conclusion • Spark Naïve Bayes classifier can be applied to any data as long as some modifications made • MDLP is useful for binning continuous features • Visualization of model provides insight, trust, and ability to answer what If questions
- 23. https://cloud.esi-group.com/analytics Barry Becker - bbe@esi-group.com

No public clipboards found for this slide

Be the first to comment