Machine Learning - Model Evaluation
Bootcamp AI — Session 3
Author: Miguel Calle | Slides: Roberto Sanchez
In this article, we are going to explain the basic definitions and give some tips of use of the different kind of machine learning algorithms.
Several Models to choose for Machine Learning.
- Logistic Regression
- Naive Bayes Classifiers
- Support Vector Machines(SVMs)
- Decision Trees
- Random Forest
- Kernel Methods
- Genetic Algorithms
- Neural Networks
Now let’s go to explore the most popular algorithms used in the analysis of data in Machine Learning.
Decision Tree
The idea of this algorithm is that we choose the best split among all features and all possible split points. The models created with this algorithm have the structure of a tree. It is compared with a flow diagram.
Decision Tree
max_depth = 2
….
Maximum depth refers to the the length of the longest path from a root to a leaf
Logistic Regression
We use this algorithm in groups of data that we have the dependent variable (target) is categorical. A clear example of logistic regression is when we need to decide if a email is spam or not.
Logistic Regression
solver = “liblinear”
multi_class = “ovr” (binary)
…
Logistic regression algorithm can use to solve the multi-classification problems. In the multi class case, the training algorithm uses the one-vs-rest (OvR) scheme.
SVM (Gaussian kernel)
We can use Support Vector Machine for a linear model in machine learning for classification and regression problems. with this algorithm we can solve linear and non-linear problems. the idea os SVM is that the algorithm creates a line or a hyperplane which separates the data into classes.
SVM
kernel = “poly”
c =1 (penalty parameter for the error )
…
Thus SVM tries to make a decision boundary in such a way that the separation between the two classes(that street) is as wide as possible. Depend on the type of data, and if the data is linearly separable or not. We can choose in kernel the options: “poly” or “linear”
Neural Network
The neural network algorithm is a computational model that is thinking to imitate the functionality of a biological neural network, with the finality to realize works of learn and solve problems.
Neural Network
hidden_layer = 2
activation = “identity”
…
Hidden layer is the layer between the input and the output called hidden_layer. The activation function is responsible for returning an output from an input value, usually the set of output values in a given range such as (0,1) or (-1,1).
Neural Network
A random forest is made of many decision trees.
Random Forest
n_estimator = 10 (Nb. of trees)
Max_depth = 2
…
n_estiamtor is the number of trees to be used in the forest. max_features on the other hand, determines the maximum number of features to consider while looking for a split
HyperParameter
In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.
We have metrics: Acurracy, confusion matrix, precision, recall, ROC, etc.
Why do we evaluate a model ‘s performance?
- To find the best preforming models
- Part of parameter tuning
- To report/publish results
- To make research/business decisions
- Can the model be used as is?
- Do we need to try to improve?
Performance Metrics
SUPERVISED
— Classification:
— — Accuracy
— — Precision & Recall
— — ROC curves & AUC
— Regression:
— — Mean square error (MSE) + Root MSE
— — Percent error
— — Mean absolute percent error
— Ranking (ordinal/discrete regression):
— — Precision at N
UNSUPERVISED
— — There is no a clear measure. It depends on the problem
Classification
When when use a Classification algorithm the result is to predict the categorical class labels of new data based on past observation.
Always in our project we need to measure the effectiveness of our model. we want see if the model has better the effectiveness, better the performance. The Confusion Matrix is a performance measurement for machine learning classification.
- Used in classification
- Show Actual vs predicted results
- Enables visualizing performance and calculating performance metrics
Classification metrics
Precision (a.k.a PPV): What percent of our predictions are correct?
Recall (a.k.a sensitivity): What percent of the accurate predictions did we capture?
F1 score: A single number that combines the two values above. Good for ranking/sorting, and imbalanced classes
Accuracy: What percent of all our predictions (positive and negative) are correct?
Classification: ROC Curve
Area under curve (the ROC curve)
A Receiver Characteristic Curve (ROC) plots the True positive rate (TPR) vs. the False positive rate (FPR). The maximum area under the curve (AUC) is 1. Completely random predictions have an AUC of 0.5. The advantage of this metric is that it is continuous.
Constructing a ROC Curve
Evaluating a Classifier: What Affects the Performance?
- Complexity of the task
- Large amounts of features (high dimensionality)
-Feature(s) appears very few times (sparse data) - Few instances for a complex classification task
- Missing feature values for instances
- Errors in attribute values for instances
- Errors in the labels of training instances
- Uneven availability of instances in classes
- Overfitting
Overfitting
A model overfits the training data when it is very accurate with that data, and may not do so well with new test data (see model 2)
What if there is not a best model?
Approach: Ensembles
- An ensemble method uses several algorithms that do the same task, and combines their results
- “Ensemble learning” - A combination function joins the results
- Majority vote: each algorithm gets a vote
- Weighted voting: each algorithm’s vote has a weight
- Other complex combination functions
A combination function joins the results:
- Majority vote: each algorithm gets a vote
- Weighted voting: each algorithm’s vote has a weight
- Other complex combination functions
Reference