Sunday, 23 January 2022

Normalization data pre-processing technique machine learning

Normalization is the process of scaling individual samples to have unit norm. This process can be useful if you plan to use a quadratic form such as the dot-product or any other kernel to quantify the similarity of any pair of samples.

This assumption is the base of the Vector Space Model often used in text classification and clustering contexts.

Data normalization is used when you want to adjust the values in the feature vector so that they can be measured on a common scale. One of the most common forms of normalization that is used in machine learning adjusts the values of a feature vector so that they sum up to 1.

Types of normalization

To normalize data, the preprocessing.normalize() function can be used. This function scales input vectors individually to a unit norm (vector length). Three types of norms are provided, l1, l2, or max.


 The function normalize provides a quick and easy way to perform this operation on a single array-like dataset, either using the l1, l2, or max norms:

How it works

1. As we said, to normalize data, the preprocessing.normalize() function can be used as follow.

import numpy as np
data = np.array([[3, -1.5, 2, -5.4], [0, 4, -0.3, 2.1], [1, 3.3, -1.9, -4.3]])
data_normalized = preprocessing.normalize(data, norm='l1', axis=0)

2. To display the normalized array, we will use the following code:

print(data_normalized)

The following output is returned:

[[ 0.75       -0.17045455  0.47619048 -0.45762712]
 [ 0.          0.45454545 -0.07142857  0.1779661 ]
 [ 0.25        0.375      -0.45238095 -0.36440678]]

 

 3. As already mentioned, the normalized array along the columns (features) must return a sum equal to 1. Let's check this for each column

data_norm_abs = np.abs(data_normalized)
print(data_norm_abs.sum(axis=0))

In the first line of code, we used the np.abs() function to evaluate the absolute value of each element in the array. In the second row of code, we used the sum() function to calculate the sum of each column (axis=0). The following results are returned:

[1. 1. 1. 1.]

 

 

 


Min and Max scaling on sample data machine learning

An alternative standardization is scaling features to lie between a given minimum and maximum value, often between zero and one, or so that the maximum absolute value of each feature is scaled to unit size. This can be achieved using MinMaxScaler or MaxAbsScaler, respectively.

The motivation to use this scaling include robustness to very small standard deviations of features and preserving zero entries in sparse data. and the standard deviation is 1.

The values of each feature in a dataset can vary between random values. So, sometimes it is important to scale them so that this becomes a level playing field. Through this statistical procedure, it's possible to compare identical variables belonging to different distributions and different variables.

 Let's see how to scale data in Python:

 1.  Let's start by defining the data_scaler variable:
    data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))

2. Now we will use the fit_transform() method, which fits the data and then transforms it (we will use the same data as in the previous recipe):

data_scaled = data_scaler.fit_transform(data)

A NumPy array of a specific shape is returned. To understand how this function has transformed data, we display the minimum and maximum of each column in the array.

3. First, for the starting data and then for the processed data:

print("Min: ",data.min(axis=0))
print("Max: ",data.max(axis=0))

The following results are returned:

Min:  [ 0.  -1.5 -1.9 -5.4]
Max:  [3.  4.  2.  2.1]

4. Now, let's do the same for the scaled data using the following code:

print("Min: ",data_scaled.min(axis=0))
print("Max: ",data_scaled.max(axis=0))

The following results are returned:

Min:  [0. 0. 0. 0.]
Max:  [1. 1. 1. 1.]

 After scaling, all the feature values range between the specified values. 

To display the scaled array, we will use the following code:

[[1.         0.         1.         0.        ]
 [0.         1.         0.41025641 1.        ]
 [0.33333333 0.87272727 0.         0.14666667]]

 Here is the complete example

Mean Removal data pre-processing technique on sample data machine learning

 Mean Removal in Machine Learning is a type of data pre-processing technique which is used to remove a mean from every feature so that it could center on zero. It also helps in removing bias from the feature.

In the real world, we usually have to deal with a lot of raw data. This raw data is not readily ingestible by machine learning algorithms. To prepare data for machine learning, we have to preprocess it before we feed it into various algorithms. This is an intensive process that takes plenty of time, almost 80 percent of the entire data analysis process, in some scenarios. However, it is vital for the rest of the data analysis workflow, so it is necessary to learn the best practices of these techniques. Before sending our data to any machine learning algorithm, we need to cross check the quality and accuracy of the data. If we are unable to reach the data stored in Python correctly, or if we can't switch from raw data to something that can be analyzed, we cannot go ahead. Data can be preprocessed in many ways—standardization, scaling, normalization, binarization, and one-hot encoding are some examples of preprocessing techniques. 

Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed data: Gaussian with zero mean and unit variance.

In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.

For instance, many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the l1 and l2 regularizers of linear models) assume that all features are centered around zero and have variance in the same order. If a feature has a variance that is orders of magnitude larger than others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

Example

from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
scaler


scaler.mean_


scaler.scale_


X_scaled = scaler.transform(X_train)
X_scaled


Output

array([[ 0.        , -1.22474487,  1.33630621],
       [ 1.22474487,  0.        , -0.26726124],
       [-1.22474487,  1.22474487, -1.06904497]])

 Scaled data has zero mean and unit variance:

You can see that the mean is almost 0 and the standard deviation is 1.

Find Us On Facebook

Computer Basics

More

C Programming

More

Java Tutorial

More

Data Structures

More

MS Office

More

Database Management

More
Top