A fundamental challenge in calcium imaging has been to infer spike rates of neurons from the measured noisy calcium fluorescence traces. We collected a large benchmark dataset (> 100.000 spikes, 73 neurons) recorded from varying neural tissue (V1 and retina) using different calcium indicators (OGB-1 and GCaMP6s). We introduce a new algorithm based on supervised learning in flexible probabilistic models and systematically compare it against a range of spike inference algorithms published previously. We show that our new supervised algorithm outperforms all previously published techniques. Importantly, it even performs better than other algorithms when applied to entirely new datasets for which no simultaneously recorded data is available. Future data acquired in new experimental conditions can easily be used to further improve its spike prediction accuracy and generalization performance. Finally, we show that comparing algorithms on artificial data is not informative about performance on real data, suggesting that benchmark datasets such as the one we provide may greatly facilitate future algorithmic developments.