Welcome to the World of Technology Opportunities
Check below for Current Opportunities
- EMAIL US @ : PRONURTURE.IT@GMAIL.COM
Terms in Machine Learning
By-Ms. Pankaja Alappanavar
Machine learning explores algorithms that learns from data and then builds models from the data. The models that are built then can be used for various tasks like predictions, decision making for solving tasks.
Author Tom M. Mitchell in his book titled “Machine learning “ defined machine learning as:
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P.
We measure performance, P of a task in T which improves with experience E .
For example, consider fraud detection , here the history of credit card transactions is considered as experience the task would be to identify if a certain transaction is fraudulent and performance would be the percentage of accurately identified fraudulent transactions.
Another commonly given example is spam filtering where previously identified mails as spam or not is considered as experience ,now when a new mail comes the task is to identify if the mail is spam or not. The performance is measured in terms of the number of mails correctly classified as spam or ham.
Below are some of the terms that one commonly comes across when working with machine learning.
• Learning Model-
Learning model consists of a learner and reasoner. The learner in the learning model takes past data i.e. experience along with some background knowledge and prepares a model. The reasoner of the learning model works with the model and provides a solution to the task. The model is trained to recognize certain kind of patterns. The model inputs data (training data) ,algorithm trains the model based on the data which can be tested using the testing data.
• Task-
Machine Learning task is the type of prediction or inference being made based on the problem and available data. For example, if the task was to assign data into classes then it is classification task.
• Features/Attributes-
Features are used to describe an item. For example, if we use speed of a vehicle to predict the distance covered by a vehicle then speed is the feature or attribute. If we use number of petals, colour of the flower, width of the petals to describe the type of flower then number of petals, colour of the flower and width of petals are the features or attributes.
• Independent and Dependent Variable-
In the table given below, the gender of a person is based (classification) on height and weight features. Here the height and weight features or attributes are known as independent variables whereas gender which depends on height and weight is known as dependent variable/target variable. A target variable can either be a class in classification task or a group in clustering task or a predicted value in regression.
• Supervised and Unsupervised learning-
Supervised learning is when labels are associated with examples. For example, if we consider the above dataset the height and weight are the features and gender is known as the target variable. Now, the above is an example of supervised learning because if the height of a person >=6 and weight >=70 then the instance (a single row/example) is considered as Male.
In unsupervised learning, data is not labelled. In the above example if the gender column was not present in the dataset then it is unsupervised learning and would probably group them into categories (2 or more) based on certain similarities.
• Semi Supervised Learning-
In semi supervised learning the number of labelled examples is far less than the number of unlabeled examples and hence the term semi supervised learning.
In simple terms, say when a teacher teaches students the concept of addition by solving a few examples first and then asks the students to solve examples on similar lines is an example of supervised learning. In the case of unsupervised learning the teacher expects the student to solve sums without the aid of the teacher and in case of semi supervised learning the teacher does teach the students but with very few examples.
• Reinforcement Learning-
In certain applications sequence of actions is the output of the system so in such case one single action is not important but a policy i.e. a sequence of actions to reach a goal is important. Example can be an automated self learning vacuum cleaner where every correct action is rewarded (avoids obstacles) and for every incorrect action the vacuum cleaner performs a penalty is awarded.
• Dataset-
Datasets consists of data in a tabular format, where each column is a feature and each row is an instance or record. The above table (Gender classification) can be considered as a dataset with 2 features height and weight and 5 instances or record. Datasets generally contain huge number of instances to train the model.
• Training and Testing Dataset-
A generalized learning model is what is generally desired. That is the model performs well on even unseen data, which are the training examples that it has not seen. For this, the dataset is divided into two sets -training and testing dataset.
The training dataset part is used to train the model and the test data set is used to test the model. The training dataset is the dataset the model sees and learns from. The observations form the experience that the algorithm uses to learn.
The test data set is the measure of true errors. These observations are used to evaluate the performance of a model. The training and testing dataset need not be the same.
• Validation dataset-
The training dataset can be further divided into training and validation dataset. The validation part of the dataset is used to test generalization of the model. This is done to ensure that the model has not memorized the data in the dataset and performs well on unseen data also. The validation data set is used to tune variables called hyper parameters which are used to control how the models learn
. • Over fitting-
When the model memorizes the training dataset it is known as over fitting. A model that memorizes the observations will not perform well on unseen data as it memorizes relations and noise.
• Under fitting-
When the model is not able to classify the data it was trained on then the model is said to be under fitting.
There should be a balance between memorization and generalization and also between over fitting and under fitting.
• Bias-
Bias is the difference between the actual value and predicted value by the model.
• Variance-
Variance tells us how scattered is the predicted value from the actual value.