Normalizing each row of a Pandas DataFrame into percentages is something an interesting step to take when analyzing data. By normalizing the data, we can easily compare values across different rows and better understand the relative importance of each value in the dataset.
Create a synthetic data
import pandas as pd
import numpy as np
np.random.seed(42)
data = np.random.random_sample((6, 2)) * 10
df = pd.DataFrame(data,columns=['A','B'])
Output
A B
0 3.745401 9.507143
1 7.319939 5.986585
2 1.560186 1.559945
3 0.580836 8.661761
4 6.011150 7.080726
5 0.205845 9.699099
So the goal here is to normalize each row of the DataFrame into percentages.
Step 1: Individually sum the rows
To do this, we must first divide each value in a row by the sum of all the values in that row. This will give us a number between 0 and 1, representing each value's percentage of the total for that row.
df.sum(axis=1)
Ouput
0 13.252544
1 13.306524
2 3.120132
3 9.242598
4 13.091876
5 9.904943
dtype: float64
Step 2: Divide each row per the sum
df[['A','B']].div(df.sum(axis=1), axis=0)
Ouput
A B
0 0.282618 0.717382
1 0.550102 0.449898
2 0.500039 0.499961
3 0.062843 0.937157
4 0.459151 0.540849
5 0.020782 0.979218
Step 3: Multiply per 100
We then multiply by 100 to get a percentage value.
df[['A','B']].div(df.sum(axis=1), axis=0) * 100
This will create a new DataFrame, with each row representing the percentage of each value in the original dataframe. Now we can easily compare values across different rows and better understand the relative importance of each value in the dataset.
A B
0 28.261752 71.738248
1 55.010153 44.989847
2 50.003865 49.996135
3 6.284339 93.715661
4 45.915117 54.084883
5 2.078204 97.921796
Round normalized dataframe
(df[['A','B']].div(df.sum(axis=1), axis=0) * 100).round(2)
Ouput
A B
0 28.26 71.74
1 55.01 44.99
2 50.00 50.00
3 6.28 93.72
4 45.92 54.08
5 2.08 97.92
References
Links | Site |
---|---|
pandas.DataFrame.sum | pandas.pydata.org |
pandas.DataFrame.div | pandas.pydata.org |