Examples of how to transform (encode) a qualitative (categorical) variable into a quantitative variable with scikit learn in python ?
Input matrix
Let's consider the following input matrix X:
from sklearn import preprocessing
import numpy as np
X = np.array(('A','C','B','A','C','D','A'))
of shape
print(X.shape)
(7,)
that can be reshaped:
X = X.reshape(-1,1)
returns
print(X.shape)
(7, 1)
Encoding the elements of matrix X using the function OrdinalEncoder
To encode the elements of matrix X a solution is to use OrdinalEncoder:
enc = preprocessing.OrdinalEncoder(categories='auto')
enc.fit(X)
print( enc.transform(X) )
returns
[[0.]
[2.]
[1.]
[0.]
[2.]
[3.]
[0.]]
Encoding the elements of matrix X using the function OneHotEncoder
Another solution to encode the elements of matrix X using the function OneHotEncoder
enc = preprocessing.OneHotEncoder(categories='auto')
enc.fit(X)
print( enc.transform(X) )
returns
(0, 0) 1.0
(1, 2) 1.0
(2, 1) 1.0
(3, 0) 1.0
(4, 2) 1.0
(5, 3) 1.0
(6, 0) 1.0
To get a matrix just use toarray() :
print( enc.transform(X).toarray() )
gives here
[[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 1. 0. 0.]
[1. 0. 0. 0.]
[0. 0. 1. 0.]
[0. 0. 0. 1.]
[1. 0. 0. 0.]]