How to Deal With Major Challenges in Machine Learning

We often get blocked at different steps while working on a machine learning problem. In order to solve almost all these steps, I have listed down all the major challenges we face and steps we can take to overcome those. I have also categorised these challenges into different sub domain for easier understanding namely Data Preparation, Model Training and Model Deployment.

Data Preparation

Data collection:

Getting incomplete data is usually a headache sometimes when we start collecting data. Even when we get data, it turns out to be bias data. Bias is any deviation from the truth in data collection or data analysis that can cause false conclusion.
Then comes the curse of dimensionality which refers to the phenomena that occur when analyzing high dimensional data that does not occur in low dimensional spaces.
Finally we have data sparsity problem. Imagine that you have a table with lots of null or impossible values. These values represents the sparsity in your data.

Steps to overcome:

Dedicate proper time to understand the problem and the proper datasets you need to solve the problem
Enrich the data
Dimension-reduction techniques

Outliers:

Out of range numerical values or unkown categorical value in our data
It shows drastic influence on squared loss functions

Steps to overcome:

Discretization techniques like binning can help in reducing the squared loss functions
Robust methods like Huber loss functions

Missing Data:

This affects in information loss and therefore affects the model’s accuracy
Information bias which happens when key information is either measured, collected, or interpreted inaccurately

Steps to overcome:

Tree based modelling techniques can help in dealing with such problem
Discretization can also help here in reducing the loss function
Imputation

Sparse target variables:

It happens when there is a low primary event occurence rate
Overwhelming preponderance of zero or missing values in target

Steps to overcome:

Proportional oversampling
Mixture models

Model Training

Overfitting:

Main reason behind overfitting is high variance and low bias that fails to generalize properly

Steps to overcome:

Regularization - It is a technique used for tuning the function by adding an additional penalty term in the error function
Noise Injection - This method refers to adding "noise" artificially to the input data during the training process
Cross validation - It is a technique that is used for the assessment of how the results of statistical analysis generalize to an independent data set

Computational resource exploitation:

Most of the times, we perform single threaded algorithm implementation
Heavy reliability on interpreted languages

Steps to overcome:

Train many single threaded models in parallel
Hardware acceleration for example GPU and SSD
Low level native libraries
Cloud - Google colab notebooks

Ensemble models:

Single model sometimes fails to provide adequate accuracy
Single model also leads to overfitting - high variance and low bias that fails to generalize properly

Steps to overcome:

Ensemble models like bagging, boosting and stacking can help overcome the problem
Custom or manual combination of prediction sometime help in achieving better accuracy

Hyper parameter tuning:

Combinatorial explosion which is a rapid growth of the complexity of a problem due to how the combinatorics of the problem is affected by the input, happens with hyper parameter in conventional algorithms.

Steps to overcome:

Local search optimization which also includes genetic algorithm
Grid search or rand search techniques help in finding the best pair of hyper parameter from the ones we feed.

Model Interpretation:

Large number of parameters and rules makes it difficult to interpret the model

Steps to overcome:

Variable selection by using regularization techniques
Surrogate models
Interpretation methods like LIME
Partial dependency plots, feature importance graphs can assist in interpreting the models

Model Deployment

Model deployment:

Trained model logic must be used from developing environment to a operational computing system to assist an organization in making decision

Steps to overcome:

Web-service scoring can help people in getting the results
Dashboards of the models ouput is easier for any organization to understand

Model decay:

From the time since the model was created, business problem and market conditions might change
New observation fall out of the domain of training data

Steps to overcome:

Regular monitoring of model especially when the accuracy decreases
Update model regularly whenever there are changes in the data or system affecting the model

Thanks for reading till the end and hope you like it !

Previously published at https://medium.com/@siddhesh_jadhav/how-to-deal-with-major-challenges-in-machine-learning-1fc7e719bd0b