Statistical practice



Any experiment that involves a source of randomness needs to stick to correct statistical practice. Without it, your conclusions will by definition be invalid. 


Intended for: BSc, MSc, PhD

Experiments & Noise


A key insight in science is that most experiments will not give the exact same outcome when your repeat them (which is the basis for the entire replication crisis in science). There are various reasons why the outcome of an experiment may change over repetitions: 

The key message of this page is that you always need to think about these possible sources of noise in your experiment, and how you will handle them, to make your interpretations valid and useful.


Finite sample size noise

Finite sample size noise is a key issue in nearly all empirical research. How we deal with the problem depends on our underlying question, which brings us to the major distinction between statistics and machine learning: 


Statistics


Machine learning / supervised learning


Reinforcement learning

In reinforcement learning we do not have a finite dataset, but access to an environment/simulator from which we can in theory sample an infinite amount of data. Thereby, the train/test split usually disappears. At test time, we simply sample new data from the environment/simulator.


Algorithm (repetition) noise


Statistics


Supervised learning


Reinforcement learning


Running repetitions

Some advice for running experiment repetitions (for reinforcement learning and supervised learning):

1. Run enough repetitions (as is computationally feasible): Aim for at least 3 to 5. 

2. Make repetitions truly independent: Make sure that every repetition is a completely new run. Reinitialize your network parameters, reinitialize your environment, etc. Make sure that the comparison is fair: each run should be a clear new problem instance that another researcher (that wants to replicate/use your work) would also face. 

3. Never tune the seed: You perform repetitions to get an estimate of how your method does on average. Therefore, your repetitions need to be randomly drawn from the space of possible experiments. 



Reporting results

Finally, you need to decide how you report on the noise in your experiments: 


1. Statistical measure of interest: This is usually the mean, but the max, median etc. statistics may also make sense, as long as you can argue why. 


2. Final performance versus learning curves: Determine whether you only want to show the final performance, or whether you want to show the entire learning curve (with some measure of execution time on the x-axis, and performance on the y-axis). 


3. Standard deviation versus standard error: You then usually want to report on the amount of noise, either in your parameters estimates (for finite sample size noise) or over your repetitions (for repetition noise). You have two choices, which are both relevant: 


4. Statistical testing: When you formally want to assess whether one method is better than another,  you should use a statistical hypothesis test.