Since 2005, I'm mainly using my laboratory email for research and I wanted to have a look to the distributions of mails I received/sent per day/month/hour. For more than 5 years, I'm also using mu to collect and index my mails. This way I'm also storing my mails on my personal computer and not only on remote IMAP server.
Getting timestamps of every mails¶
First, we will get every mails with mu find
command and we will dispatch them between the ones I sent and the ones I received. The LC_TIME=C
variable is just to make sure the output time is not in locale format.
!LC_TIME=C mu find -f "d,m" "" | awk -F, '/sent/{print $1>"/tmp/sent.csv"}{print $1>"/tmp/received.csv"}'
Then, we will prepare a pandas.DataFrame
in which we will count the number of mails per hour range.
import pandas as pd
def get_mail_count(fn):
df = pd.read_csv(
fn,
header=None,
names=["date"],
parse_dates=["date"]
)
return df.date.dt.floor("h").value_counts().to_frame().sort_index()
sent = get_mail_count("/tmp/sent.csv")
received = get_mail_count("/tmp/received.csv")
received.head()
count | |
---|---|
date | |
2005-04-06 15:00:00 | 2 |
2005-06-01 11:00:00 | 1 |
2005-07-04 17:00:00 | 1 |
2005-09-21 21:00:00 | 1 |
2005-09-27 14:00:00 | 1 |
So, we are good to go and prepare the visualization process.
Interactive chart with plotly
¶
I wanted to plot the number of mails per day in the way Github/Gitlab show the project contribution for a user; something fancy (may be not really instructive) but also interactive. plotly
has support for Heatmap
but we have to tweak its appearance to make it working. I've found this gist that makes all the hard work. I've changed it a bit, especially using the calendar
library and I've also added some options such as the color palette. Below is the full code.
import calendar
import datetime
import numpy as np
import plotly.graph_objs as go
import seaborn as sns
from plotly.subplots import make_subplots
def display_year(
z,
year: int,
month_lines: bool = True,
fig=None,
row: int = None,
title: str = None,
palette: str = "rocket_r",
):
month_names = [calendar.month_abbr[i] for i in range(1, 13)]
month_days = [calendar.monthrange(year, i)[-1] for i in range(1, 13)]
month_positions = (np.cumsum(month_days) - 15) / 7
month_days = sum(
[list(zip([i + 1] * m, range(1, m + 1))) for i, m in enumerate(month_days)], []
)
weekdays_in_year = [calendar.weekday(year, month, day) for month, day in month_days]
dates_in_year = [datetime.date(year, month, day) for month, day in month_days]
weeknumber_of_dates = []
for date in dates_in_year:
inferred_week_no = date.isocalendar().week
if inferred_week_no >= 52 and date.month == 1:
weeknumber_of_dates.append(0)
elif inferred_week_no == 1 and date.month == 12:
weeknumber_of_dates.append(53)
else:
weeknumber_of_dates.append(inferred_week_no)
data = [
go.Heatmap(
x=weeknumber_of_dates,
y=weekdays_in_year,
z=z,
text=[str(date) for date in dates_in_year],
hoverinfo="text+z",
xgap=3,
ygap=3,
showscale=False,
colorscale=["#eeeeee"] + sns.color_palette(palette).as_hex(),
)
]
if month_lines:
kwargs = dict(
mode="lines",
line=dict(
color="#9e9e9e",
width=1,
),
hoverinfo="skip",
)
for date, dow, wkn in zip(dates_in_year, weekdays_in_year, weeknumber_of_dates):
if date.day == 1:
data += [
go.Scatter(
x=[wkn - 0.5, wkn - 0.5],
y=[dow - 0.5, 6.5],
**kwargs,
)
]
if dow:
data += [
go.Scatter(
x=[wkn - 0.5, wkn + 0.5],
y=[dow - 0.5, dow - 0.5],
**kwargs,
),
go.Scatter(
x=[wkn + 0.5, wkn + 0.5],
y=[dow - 0.5, -0.5],
**kwargs,
),
]
layout = go.Layout(
title=title,
height=250,
yaxis=dict(
showline=False,
showgrid=False,
zeroline=False,
tickmode="array",
ticktext=[calendar.day_abbr[i] for i in range(7)],
tickvals=list(range(7)),
autorange="reversed",
),
xaxis=dict(
showline=False,
showgrid=False,
zeroline=False,
tickmode="array",
ticktext=month_names,
tickvals=month_positions,
),
font={"size": 10, "color": "#9e9e9e"},
plot_bgcolor=("#fff"),
margin=dict(t=40),
showlegend=False,
)
if fig is None:
fig = go.Figure(data=data, layout=layout)
else:
fig.add_traces(data, rows=[(row + 1)] * len(data), cols=[1] * len(data))
fig.update_layout(layout)
fig.update_xaxes(layout["xaxis"])
fig.update_yaxes(layout["yaxis"])
return fig
We can now use it with our Dataframes
to plot the number of mails received and sent per day. Here, we use all the power of pandas
to parse and extract time information from pandas.datetime
series.
def display_years(df, palette):
years = df.index.year.unique().tolist()
fig = make_subplots(
rows=len(years), cols=1, subplot_titles=years, vertical_spacing=0.2 / len(years)
)
for i, year in enumerate(years):
data = np.zeros(365 + calendar.isleap(year))
mask = df.index.year == year
for index, row in df[mask].iterrows():
data[index.day_of_year - 1] = row["count"]
display_year(data, year=year, fig=fig, row=i, palette=palette)
fig.update_layout(height=250 * len(years))
return fig
Lets' plot the number of mails received per day and for each year
display_years(received, palette="Greens")
Quite interestingly, I received significantly less mails in the 2013 and 2017 years, don't know why. During my PhD years 2005 - 2008, the number of mails is also more scarce whereas, when I started having teaching responsabilities, the distribution of mails looks much more flat and homogeneous. Let's have a look to the mails I sent
display_years(sent, palette="Blues")
The first obvious remark is that I've lost all mails between 2005 and 2011. I guess I've lost them when I did some cleaning which might also be the reason for having lose mail in the second part of 2012 and 2013. It is also clear that I sent much less mails during summer holidays especially in August.
Linear visualization of the variation of number of mails¶
Let's have a look to how the distribution of mails looks in a more usual way. We will first resample data by summing the number of mails per day and applying a Kernel Density Estimation aka KDE to smooth the results an then plot the number of mails versus time.
pd.options.plotting.backend = "plotly"
def get_kde_resample(sample="D"):
return pd.concat(
[
globals()[flag]
.resample(sample)
.sum()
.rolling(50, center=True, win_type="gaussian")
.mean(std=10)
.assign(flag=flag)
for flag in ["received", "sent"]
]
)
df = get_kde_resample()
fig = df.plot(
**(
kwargs := dict(
y="count",
color="flag",
labels={"count": "number of mails", "flag": ""},
template="seaborn",
)
)
).update_layout(
**(
layout_kwargs := dict(
xaxis_title=None,
legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
)
)
)
for year in df.index.year.unique():
fig.add_vrect(x0=f"{year}-07-01", x1=f"{year}-08-31", fillcolor="grey", opacity=0.25)
fig
Here the gray area correspond to summer periods just to highlight the decrease of mail activity. There are also a large increase of mails received just before 2019 and 2020 summer periods which, I guess, corresponds to automatic mails when students get registered at the university (I do not receive these mails anymore).
Finally let's have a look to the distribution of mails per year, hour and per weekday. We start by merging the received/sent Dataframes
and add a flag to isolate them
df = pd.concat([globals()[flag].assign(flag=flag) for flag in ["received", "sent"]])
Below is the total number of mails received/sent per year...
by_year = (
df.groupby(by=["flag", df.index.year])
.sum()
.unstack()
.fillna(0)
.astype(int)
.transpose()
.droplevel(0)
)
by_year.style.background_gradient(axis=0, cmap=sns.diverging_palette(220, 20, as_cmap=True))
flag | received | sent |
---|---|---|
date | ||
2005 | 31 | 0 |
2006 | 252 | 0 |
2007 | 490 | 0 |
2008 | 1019 | 0 |
2009 | 3533 | 0 |
2010 | 4196 | 0 |
2011 | 6289 | 1198 |
2012 | 7559 | 799 |
2013 | 7601 | 0 |
2014 | 12446 | 2028 |
2015 | 13716 | 2315 |
2016 | 8586 | 1973 |
2017 | 10565 | 2533 |
2018 | 12196 | 2869 |
2019 | 12102 | 2798 |
2020 | 12514 | 2349 |
2021 | 7884 | 944 |
2022 | 7698 | 984 |
2023 | 5046 | 654 |
... and the same information as graphics
(
pd.melt(by_year, value_name="count", ignore_index=False)
.plot(**kwargs)
.update_layout(**layout_kwargs)
)
Let's have a look to the mean number of mails per hour period and weekday
(
df.groupby(by=["flag", df.index.time])
.mean()
.reset_index(level=0)
.plot(**kwargs)
.update_layout(**layout_kwargs)
)
(
df.groupby(by=["flag", df.index.dayofweek])
.mean()
.reset_index(level=0)
.plot.bar(**kwargs)
.update_layout(
xaxis=dict(
title=None, tickvals=list(range(7)), ticktext=[calendar.day_name[i] for i in range(7)]
),
**layout_kwargs
)
)
Nothing really striking, weekends are more quiet just like nights and noon times.
Of course, this can be applied to much more activity stuff and I will see if I reuse this notebook in the future.