Tips for Python - Numpy and Pandas library

2020-12-30

python / tips / datascience / pandas / numpy

Introduction

Saving time for cleaning, tidying and processing data is quite useful in data science. It means that you can get more time to analyses and think about solutions.

If I working Data Science with Python, usually I am using Pandas and Numpy library. It is a great library with a lot smart functions. However, sometimes I forget some functions and write my own functions to solve calculations. For practicing it is cool, but it spends some time.

As a reference guide I will write some interesting functions built-in Pandas to enforce my memory and besides, maybe I can help some one.

Pandas

Function accumulate

Let suppose for a 10 days series with rainy data:

df = pd.DataFrame({"DAY": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
                   "RAIN": [6, 0, 0, 20, 30, 40, 0, 0, 10, 15]})

... >>> df
   DAY  RAIN
0    1     6
1    2     0
2    3     0
3    4    20
4    5    30
5    6    40
6    7     0
7    8     0
8    9    10
9   10    15

For this data We need to calculate the accumulate sum.

>>>  df['RAIN'].cumsum()
0      6
1      6
2      6
3     26
4     56
5     96
6     96
7     96
8    106
9    121
Name: RAIN, dtype: float64

Function diff

Now we desire to calculate the rain difference between each day.

>>> df['RAIN'].diff()
0     NaN
1    -6.0
2     0.0
3    20.0
4    10.0
5    10.0
6   -40.0
7     0.0
8    10.0
9     5.0
Name: RAIN, dtype: float64

Note that in the first row the NaN value, If you want remove this just use df['RAIN'].diff()[1:]

Rolling function

To create a window to get data from a column is possible to write a code with looping (for):

for i in range(1, len(df)):
    cc = df['RAIN'].iloc[i-1:i+1].mean()
    print(cc)
>>> 3.0
0.0
10.0
25.0
35.0
20.0
0.0
5.0
12.5

Instead a looping, it is possible to use the rolling method. This method Provide rolling window calculations. For each row is grepped an assigned range, e.g. if the window’s size is 2, for the row 6 the values from row 6 until 5 will be get to calculations.

>>> df['RAIN'].rolling(2).mean()
0     NaN
1     3.0
2     0.0
3    10.0
4    25.0
5    35.0
6    20.0
7     0.0
8     5.0
9    12.5
Name: RAIN, dtype: float64

You can replace mean() for sum() or another function.

Numpy

Create a empty array

Supposing you wanna to create an empty array with 3 rows and 5 columns.

arr = np.empty(shape=(3, 5))
arr[:] = np.NaN
>>> arr
array([[nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan],
       [nan, nan, nan, nan, nan]])

Now We will fill some values:

>>> arr[0] = np.random.randint(0, 20, 5)
>>> arr[1] = np.random.randint(0, 20, 5)
>>> arr[2] = np.random.randint(0, 20, 5)
>>> arr
array([[ 7.,  8.,  5.,  5., 17.],
       [16., 14., 15., 19., 16.],
       [15.,  6.,  9.,  6.,  5.]])

To get average for rows or columns:

>>> arr.mean(axis=0)
array([12.66666667,  9.33333333,  9.66666667, 10.0 , 12.66666667])
>>> arr.mean(axis=1)
array([ 8.4, 16.0 , 8.2])

References

https://pandas.pydata.org

https://www.python.org/