Page 1 – Big Data Industry Process – Adil ZEAARAOUI
Big Data Industry Process
Definition:
Big data process is the set of activities: business understanding, data collection, data
exploration, data preprocessing, data mining, model evaluation and deployment; processed
together in order to extract hidden information from a mass of data.
Fig.1: General overview of big data process
Big data process activities:
During my experience in Data Science, i come up to resume the process of big data in the
following steps:
Step1: Understand the business
In this step, we are concerned to:
 Well define the problem and its scope
 Have a clear view of the goal
 Draw the path to the objective
Page 2 – Big Data Industry Process – Adil ZEAARAOUI
Step2: Collect the data
Import and collect the data from different sources like: RDMS, datalake store,
datawarehouse...etc.
Step3: Understand and explore data
Before any kind of development, we must first explore our dataset. The exploration is
manifesting in :
 Explore features
 Distinguish categorical features from numerical ones
 Do statistical analysis: min, max, mean, standard deviation, variance...etc.
 Visualize data: missing values for each feature, unique values, how values are
distributed…etc.
 Define business important features
Step4 : Pre-process data
This is the important step in big data; it can take up to 90% of the whole process. This step
intends to prepare data before mine it. We must do:
 Correct wrong input values
 Remove missing values
 Fill the rest of missing values
 Discretize continues features
 Remove correlated features
 Normalize features if required
 Remove outliers if necessary
 Etc.
Step4: Develop your model (Data mining)
After building a clean and “ready to process” dataset, it is time to build our model.
 Transform our dataset if required
 Apply our machine-learning algorithm
Page 3 – Big Data Industry Process – Adil ZEAARAOUI
Step5: Evaluate and deploy the model
Before deployment, we must validate and see how accurate is our model. So we must :
 Evaluate and test the model
 Review and enhance it
 Deploy the model
 Automate the system workflow

Big data Industry Process

  • 1.
    Page 1 –Big Data Industry Process – Adil ZEAARAOUI Big Data Industry Process Definition: Big data process is the set of activities: business understanding, data collection, data exploration, data preprocessing, data mining, model evaluation and deployment; processed together in order to extract hidden information from a mass of data. Fig.1: General overview of big data process Big data process activities: During my experience in Data Science, i come up to resume the process of big data in the following steps: Step1: Understand the business In this step, we are concerned to:  Well define the problem and its scope  Have a clear view of the goal  Draw the path to the objective
  • 2.
    Page 2 –Big Data Industry Process – Adil ZEAARAOUI Step2: Collect the data Import and collect the data from different sources like: RDMS, datalake store, datawarehouse...etc. Step3: Understand and explore data Before any kind of development, we must first explore our dataset. The exploration is manifesting in :  Explore features  Distinguish categorical features from numerical ones  Do statistical analysis: min, max, mean, standard deviation, variance...etc.  Visualize data: missing values for each feature, unique values, how values are distributed…etc.  Define business important features Step4 : Pre-process data This is the important step in big data; it can take up to 90% of the whole process. This step intends to prepare data before mine it. We must do:  Correct wrong input values  Remove missing values  Fill the rest of missing values  Discretize continues features  Remove correlated features  Normalize features if required  Remove outliers if necessary  Etc. Step4: Develop your model (Data mining) After building a clean and “ready to process” dataset, it is time to build our model.  Transform our dataset if required  Apply our machine-learning algorithm
  • 3.
    Page 3 –Big Data Industry Process – Adil ZEAARAOUI Step5: Evaluate and deploy the model Before deployment, we must validate and see how accurate is our model. So we must :  Evaluate and test the model  Review and enhance it  Deploy the model  Automate the system workflow