statistics

Bayesian analysis of blog usage

Posted on Updated on

The TopoToolbox blog has now been online since October 2014. Since then, the blog had ~22k visitors. With a total of 8.7k, 2017 was the year with the most visitors. Yet, monthly statistics show quite some variability throughout the years. The blog tends to have fewer visitors in December and during the summer months July and August than during the remaining months. In addition, there are strong fluctuations between usage statistics of months that have seen many posts and those when I had no time to post.  This variability superimposes the general trend. We may thus ask the question, whether we truely experience an increase in visitor counts.

To answer this question, I took monthly visitor counts during the last two years (unfortunately, I can only access the last two years). I saved them in an Excel file which I then imported to MATLAB. My aim was then to do a regression analysis to be able to identify possible trends. Since the dependent data are counts, a standard least squares regression is clearly not advisable. Rather, count data are modelled using Poisson statistics. I thus chose a generalized linear regression model with counts as a Poisson response, and time as independent variable. I used ‘log’ as a link function so that the response variable is Y~Poisson(exp(b1 + b2(X))).

The function fitglm (which is part of the Statistics and Machine Learning Toolbox) is perfectly suited to solve this regression problem. However, the function takes a frequentist approach which I didn’t want to follow here. Rather, I wanted to learn how to tackle this problem with Bayesian inference. Gladly, MATLAB increasingly features tools for Bayesian analysis such as the Hamiltonian Monte Carlo (HMC) sampler. The only thing I needed was some guidance of how to derive the prior distributions and how to calculate the posterior probability density function. For this, I found this paper by Doss and Narasimhan quite useful.

Writing tools for Bayesian analysis is not easy and the code quite difficult to debug. But finally, I managed to get my analysis working, and to return following figure:

blogstats.png
Bayesian analysis of monthly blog statistics.

Clearly, the posterior distribution of beta_1, the slope of the regression, does not include zero so I can assume with high certainty, that we truly observe a trend to higher visitor numbers. Can this finding be extrapolated? Well, I don’t know. Just keep on visiting this blog frequently, and we’ll see.

If you are interested in the technical details, here is the function:

function bayesianpoissonregress(t,y)

%Bayesianpoissonregression
%
% t --> datetime vector (equally spaced)
% y --> vector with counts

t = t(:);
y = y(:);

% Numeric month indices
X = (1:size(t,1))';
% add offset to design matrix
X = [ones(size(X,1),1) X];

% link function
gamma = @(beta) exp(beta(:)'*X')';

% Prior distributions of the parameters are gaussian
a = [0 0]; % prior means
b = [20 20]; % prior std

d = a./b + sum(X.*y);

%% MCMC sampling
hmc = hmcSampler(@(beta) logpdf(beta),[0 0],'UseNumericalGradient',true);
% Estimate max. a posteriori
map = estimateMAP(hmc);
% tune sampler
hmc = tuneSampler(hmc,'StartPoint',map);
% draw chain
chain = drawSamples(hmc,'StartPoint',map);


%% Plot results
figure
subplot(1,2,1)
[~,ax] = plotmatrix(chain);
ylabel(ax(1,1),'\beta_0');
ylabel(ax(2,1),'\beta_1');
xlabel(ax(2,1),'\beta_0');
xlabel(ax(2,2),'\beta_1');

subplot(1,2,2)
% random set of parameters from chain
sam = randperm(size(chain,1),100);
sam = chain(sam,:);

hold on
for r = 1:size(sam,1)
    chat = gamma(sam(r,:)); 
    plot(t,chat,'color',[.7 .7 .7]); 
end
plot(t,y,'--s')
ylabel('Visitors')
box on
% -- end of function

%% Log pdf
function p = logpdf(beta)
% posterior distribution according to 
% http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39.9684&rep=rep1&type=pdf
% (Eq. 2.2)

% posterior
p = - sum(1./(b.^2) .* beta.^2) + sum(d .* beta) - sum(gamma(beta));

end
end
Advertisements