Activity on the pandas github repo during the March 10 documentation sprint

Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!

I thought it would be nice to make a figure of the activity on github during the sprint. Using https://www.githubarchive.org/ and the bigquery interface to their data, it was quite easy. The following query counts the hourly number of events on the pandas-dev/pandas repo for the last two weeks:

SELECT 
  STRFTIME_UTC_USEC(created_at, "%Y-%m-%d %H") AS timestamp,
  COUNT(*) AS count
FROM (
  TABLE_DATE_RANGE([githubarchive:day.], 
    TIMESTAMP('2018-03-01'), 
    TIMESTAMP('2018-03-13')
  )) 
WHERE repo.name = 'pandas-dev/pandas'
GROUP BY
  timestamp,
ORDER BY
  timestamp ASC

The above query looks for all types of events on github, so it's a total of issues or PRs opened or closed, comments, pushed, ... (https://developer.github.com/v3/activity/events/types/).

I downloaded the result of the above query as a csv file (note: there are also packages available to directly load the result of the query in a pandas DataFrame). So we can now use pandas and matplotlib to make a graph of it.

In [1]:
%matplotlib inline

import pandas as pd
import matplotlib.pyplot as plt
In [2]:
events = pd.read_csv("results-20180313-132419.csv", index_col=0, parse_dates=True)
In [3]:
events.head()
Out[3]:
count
timestamp
2018-03-01 00:00:00 6
2018-03-01 01:00:00 24
2018-03-01 02:00:00 13
2018-03-01 03:00:00 4
2018-03-01 04:00:00 1

Some of the hours are missing because there were no recorded events, so to make sure we have a regular time series, I am using resample to have an hourly frequency while filling the missing hours with 0:

In [4]:
events = events.resample('H').asfreq().fillna(0)['count']

Now we can make a plot of this:

In [5]:
fig, ax = plt.subplots(dpi=120)
events.plot(ax=ax)
ax.set(xlabel='', ylabel="Number of hourly events", title="GitHub activity in the pandas repo")
ax.annotate("What happened here?", (pd.Timestamp("2018-03-10"), 150), (pd.Timestamp("2018-03-02"), 200),
            arrowprops=dict(shrink=0.05, width=1, color='k'), fontsize=14)
fig.tight_layout()

So as expected, we clearly see a huge peak in github activity compared to the weeks before :-)

Many thanks to all organizers and contributors of the sprint. A lot of people learned about contributing to open source, ànd it made a significant impact on the quality of the pandas API documentation!

This post was written in the Jupyter notebook. You can download this notebook.

Comments

Please enable JavaScript to view the comments powered by Disqus.