Introduction
Saving time for cleaning, tidying and processing data is quite useful in data science. It means that you can get more time to analyses and think about solutions.
If I working Data Science with Python, usually I am using Pandas and Numpy library. It is a great library with a lot smart functions. However, sometimes I forget some functions and write my own functions to solve calculations. For practicing it is cool, but it spends some time.
As a reference guide I will write some interesting functions built-in Pandas to enforce my memory and besides, maybe I can help some one.
Pandas
Function accumulate
Let suppose for a 10 days series with rainy data:
df = pd.DataFrame({"DAY": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
"RAIN": [6, 0, 0, 20, 30, 40, 0, 0, 10, 15]})
... >>> df
DAY RAIN
0 1 6
1 2 0
2 3 0
3 4 20
4 5 30
5 6 40
6 7 0
7 8 0
8 9 10
9 10 15
For this data We need to calculate the accumulate sum.
>>> df['RAIN'].cumsum()
0 6
1 6
2 6
3 26
4 56
5 96
6 96
7 96
8 106
9 121
Name: RAIN, dtype: float64
Function diff
Now we desire to calculate the rain difference between each day.
>>> df['RAIN'].diff()
0 NaN
1 -6.0
2 0.0
3 20.0
4 10.0
5 10.0
6 -40.0
7 0.0
8 10.0
9 5.0
Name: RAIN, dtype: float64
Note that in the first row the NaN
value, If you want remove this just
use df['RAIN'].diff()[1:]
Rolling function
To create a window to get data from a column is possible to write a
code with looping (for
):
for i in range(1, len(df)):
cc = df['RAIN'].iloc[i-1:i+1].mean()
print(cc)
>>> 3.0
0.0
10.0
25.0
35.0
20.0
0.0
5.0
12.5
Instead a looping, it is possible to use the rolling
method. This
method Provide rolling window calculations. For each row is grepped an
assigned range, e.g. if the window’s size is 2, for the row 6 the values
from row 6 until 5 will be get to calculations.
>>> df['RAIN'].rolling(2).mean()
0 NaN
1 3.0
2 0.0
3 10.0
4 25.0
5 35.0
6 20.0
7 0.0
8 5.0
9 12.5
Name: RAIN, dtype: float64
You can replace mean()
for sum()
or another function.
Numpy
Create a empty array
Supposing you wanna to create an empty array with 3 rows and 5 columns.
arr = np.empty(shape=(3, 5))
arr[:] = np.NaN
>>> arr
array([[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan],
[nan, nan, nan, nan, nan]])
Now We will fill some values:
>>> arr[0] = np.random.randint(0, 20, 5)
>>> arr[1] = np.random.randint(0, 20, 5)
>>> arr[2] = np.random.randint(0, 20, 5)
>>> arr
array([[ 7., 8., 5., 5., 17.],
[16., 14., 15., 19., 16.],
[15., 6., 9., 6., 5.]])
To get average for rows or columns:
>>> arr.mean(axis=0)
array([12.66666667, 9.33333333, 9.66666667, 10.0 , 12.66666667])
>>> arr.mean(axis=1)
array([ 8.4, 16.0 , 8.2])