Usually I need to classify some kind of data. Some weeks ago, I had a huge file with dates and precipitation, and one of the tasks was classify each day as ten-day. Thus, each month was divided in three sections as follow:
- from day 01 up to 10 is
*ten-day 01*
- from day 11 up to 20 is
*ten-day 02*
- from day 21 up to 31 is
*ten-day 03*
Most of time I had used loops
(Python
language), however this time the process took a little bit slow. To improve this process I tried map
method.
Regarding the article in Real Python - Python’s map(): Processing Iterables Without a Loop, the function map()
has two advantages:
Since map() is written in C and is highly optimized, its internal implied loop can be more efficient than a regular Python for loop. This is one advantage of using map().
A second advantage of using map() is related to memory consumption. With a for loop, you need to store the whole list in your system’s memory. With map(), you get items on demand, and only one item is in your system’s memory at a given time.
So, let’s try!
Hands On!
Load packages and create a data frame
:
import pandas as pd
import numpy as np
import time
df_prob = pd.DataFrame({'day': np.random.uniform(1, 31, 50000).round(0)})
df_prob.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 day 50000 non-null float64
dtypes: float64(1)
memory usage: 390.8 KB
Now, comparing Loop vs map()
# loop
start_time = time.time()
ls_desc = []
for idx in range(len(df_prob)):
if df_prob['day'].iloc[idx] <= 10:
dd = '01'
elif df_prob['day'].iloc[idx] > 10 and \
df_prob['day'].iloc[idx] <= 20:
dd = '02'
elif df_prob['day'].iloc[idx] > 20 and \
df_prob['day'].iloc[idx] <= 31:
dd = '03'
else:
dd = '-99'
ls_desc.append(dd)
print("loop --- %s seconds ---" % (time.time() - start_time))
# map
start_time = time.time()
ls_desc = list(map(lambda x: '01' if x <= 10 else
('02' if x > 10 and x <= 20 else(
'03' if x > 20 and x <= 31 else '-99')),
df_prob['day'].values))
print("map --- %s seconds ---" % (time.time() - start_time))
The results:
>>> loop --- 0.8795549869537354 seconds ---
>>> map --- 0.022134780883789062 seconds ---
The function map
was better than loop
. For a small data like this example, maybe it does not matter. But for a large file ….
Notice that is necessary nested if / else
to run into map
function.
Extra test
After all, I tough about R statistical software. It is because I use R for several analysis. Actually, R is my favorite choice for data analyses.
How was R (base):
df_prob = data.frame('day' = round(runif(50000, 1, 31)))
start_time = Sys.time()
ls_desc <- ifelse(
df_prob$day <= 10, '01',
ifelse(df_prob$day > 10 & df_prob <= 20, '02',
ifelse(df_prob$day > 20 & df_prob$day <= 31, '03', '-99')))
final_time <- (Sys.time() - start_time)
print(paste(" --- %s seconds ---", final_time))
> > [1] "loop --- %s seconds --- 0.0353324413299561"
R was fast as map()
function in Python (ok, a little bit slower). Besides, both of them have a similar style.
Hardware and software
The test was performed on a workstation:
rafatieppo@rt-av52a:~/Dropbox/emacs_dot$ screenfetch
_,met$$$$$gg. rafatieppo@rt-av52a
,g$$$$$$$$$$$$$$$P. OS: Debian 10 buster
,g$$P"" """Y$$.". Kernel: x86_64 Linux 4.19.0-16-amd64
,$$P' `$$$. Uptime: 1h 13m
',$$P ,ggs. `$$b: Packages: 2699
`d$$' ,$P"' . $$$ Shell: bash 5.0.3
$$P d$' , $$P Resolution: 3840x1080
$$: $$. - ,d$$' DE: MATE 1.20.2
$$\; Y$b._ _,d$P' WM: Metacity (Marco)
Y$$. `.`"Y$$$$P"' GTK Theme: 'Adapta' [GTK2/3]
`$$b "-.__ Icon Theme: Numix
`Y$$ Font: Monaco 14
`Y$$. CPU: Intel Core i5-9300H @ 8x 4.1GHz [44.0°C]
`$$b. GPU: GeForce GTX 1050
`Y$$b. RAM: 1862MiB / 15901MiB
`"Y$b._
`""""
References
https://realpython.com/python-map-function/