Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Pre-Processing with Spark and Cassandra for Machine Learning


Published on

Join us as we walk through the first actual step on the road to a complete machine learning model, dealing with our data. We will learn about how to load data into our training environment as well as some of the different ways that data may have to be manipulated before we can use it to train our models.


Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Data Pre-Processing with Spark and Cassandra for Machine Learning

  1. 1. Version 1.0 Data Pre-Processing with Spark and Cassandra for Machine Learning An Anant Corporation Story.
  2. 2. Introduction What is data pre-processing and why do we do it? A brief look at different types of pre-processing. Demo with Cassandra and Spark.
  3. 3. ● a set of transformations applied to data that prepares it for use with machine learning algorithms ● data comes in completely raw ○ Fields in various types ○ Values missing ○ Redundant fields ● Also gives us insight into our data ● algorithms require data to be in specific formats ● raw data will not meet the conditions for the algorithm Pre-processing Overview
  4. 4. ● Vectorization ● Imputation ● Standardization ● Encoding ● PCA Types of pre-processing
  5. 5. ● Many machine learning algorithms require that data be fed in as a single vector ● Data in seperate fields in a spark dataframe does not fullfil this need ● Spark has a specific VectorAssembler class which takes specified columns and creates a single vector containing them Vectorization
  6. 6. ● a method for dealing with missing data, besides dropping any rows with missing fields ● Mean Imputation ○ Mean imputation involves replacing the missing data with the mean from all of the existing examples of that value ○ Other functions can replace the mean here like median, lowest, highest, or random Imputation ● ML Imputation ○ train a model with our existing data to predict a value for the missing field ○ If multiple fields are missing we would need models for each
  7. 7. ● process of centering your data around a specific value and scaling it to within a certain range ● Some machine learning algorithms work best with data between zero and one ● some need data to be centered around zero Standardization
  8. 8. ● Binary/Categorical ○ Binary encoding turns values with only two possible values into integers with the value of either 0 or 1 ○ Categorical encoding works for values with greater numbers of possible values, as long as the number is finite ■ Each possible value becomes a unique integer. ■ Works well on string fields Encoding ● One-Hot ○ another method of categorical encoding where each possible value becomes a specific place in a string of binary digits ○ Some algorithms prefer one hot encoded values to categorized integer values
  9. 9. ● Principal Component Analysis is a method of data processing meant to cut down on the number of fields being fed into a machine learning algorithm ● a method of analysis as well, telling us which fields in our data are more or less correlated with our label ● it transforms our data, rotating our axes until we have the number that we desire, in a combination that explains the most variance in our data PCA
  10. 10. Demo
  11. 11. Strategy: Scalable Fast Data Architecture: Cassandra, Spark, Kafka Engineering: Node, Python, JVM,CLR Operations: Cloud, Container Rescue: Downtime!! I need help. | | (855) 262-6826 3 Washington Circle, NW | Suite 301 | Washington, DC 20037
  12. 12. Questions?