FST/TST Live Scoring Pad

Score forced swim test or tail suspension test behavior in real-time with keyboard shortcuts. Get immobility %, latency, bout analysis, time-bin breakdown, group comparisons, and inter-rater reliability.

Live ScoringInter-Rater QCCSV Export
Tool details, related tools, and citation

Try it out

Load example FST/TST scoring data to see the full workflow

Assay Configuration

Behaviors
IImmobilitySSwimmingCClimbing

Animal Details

Scoring Pad — Forced Swim Test

0:00.0
/ 6:00.0
  • Score Forced Swim Test trials in real time using keyboard shortcuts for immobility, swimming, and climbing behaviors
  • Score Tail Suspension Test trials in real time using keyboard shortcuts for immobility and active struggling
  • Validate automated video scoring systems (ANY-maze, EthoVision, ConductVision) by comparing against manual scores
  • Assess inter-rater reliability between two scorers using Cohen's kappa on time-bin data
  • Analyze pre-scored behavioral event data by importing timestamped logs for time-bin and group-level statistics
  • Compare treatment groups by exporting scored data for downstream statistical analysis (t-tests, ANOVA)

Don't use for

  • Automated video-based behavior classification — use ConductVision, which classifies immobility, swimming, and climbing automatically from video recordings without manual scoring
  • Pre-scored immobility totals that only need group statistics — use the FST Immobility Calculator or TST Immobility Calculator for batch analysis of already-quantified data
  • Non-despair behavioral assays (open field, elevated plus maze, Morris water maze) — this tool's behavioral categories and scoring logic are specific to the FST and TST

Manual Behavioral Scoring: Principles and Practice

Manual behavioral scoring remains the gold standard for validation in preclinical behavioral neuroscience, despite the availability of automated tracking systems. The fundamental principle is that a trained human observer classifies an animal's behavior into mutually exclusive categories at each moment in time. For the Forced Swim Test, the standard categories are immobility (floating with only minor movements to maintain head position), swimming (horizontal movement through the water, typically associated with serotonergic mechanisms), and climbing (vigorous upward-directed movements against the cylinder wall, associated with noradrenergic mechanisms). For the Tail Suspension Test, the primary distinction is between immobility (passive hanging without limb or body movement) and active struggling (curling, swinging, or limb movement). The reliability of manual scoring depends on three factors: clear operational definitions that minimize ambiguity at behavioral boundaries, consistent training across all scorers using reference videos with known scores, and blinding of the scorer to treatment group assignment to prevent expectation bias. Scoring can be performed live during the test session or from video recordings. Live scoring has the advantage of real-time data availability and avoids the storage and review time required for video, but it does not allow the scorer to pause, rewind, or review ambiguous moments. Video-based scoring allows frame-by-frame analysis but introduces the risk of altered perception at different playback speeds. For both approaches, the scorer should maintain a consistent temporal sampling strategy — either continuous real-time coding (as this tool provides) or instantaneous sampling at fixed intervals (e.g., scoring behavior every 5 seconds). Continuous coding captures the exact duration of each behavior and is more sensitive to brief behavioral changes, while instantaneous sampling is simpler but can miss short-duration events. Whichever method is chosen, it must be reported in the methods section and applied consistently across all animals in the study.

FST vs TST: When to Use Each Assay

The Forced Swim Test and Tail Suspension Test are the two most widely used acute assays for assessing despair-like behavior in rodents, and choosing between them depends on species, research question, and practical constraints. The FST was introduced by Porsolt, Le Pichon, and Jalfre in 1977 using rats, and was later adapted for mice. It involves placing the animal in a cylinder of water (25 +/- 1 degrees Celsius) from which it cannot escape, and measuring immobility during a test session (typically 5-6 minutes), often preceded by a 15-minute pretest swim 24 hours earlier in rats (the pretest is sometimes omitted in mice). The FST is applicable to both rats and mice and has the advantage of allowing differentiation between swimming (serotonergic) and climbing (noradrenergic/catecholaminergic) active behaviors, as demonstrated by Detke, Rickels, and Lucki (1995). However, the FST involves water immersion, which introduces the confounds of hypothermia, wet fur affecting subsequent testing, and the physical stress of swimming. The TST was developed by Steru, Chermat, Thierry, and Simon in 1985 as an alternative that avoids these water-related confounds. Mice are suspended by the tail using adhesive tape, and immobility is measured over a 6-minute trial with no pretest required. The TST is faster to set up, avoids hypothermia, and produces highly reproducible results in mice. However, it is generally limited to mice (rats are too heavy for reliable suspension), cannot distinguish between swimming and climbing (since the animal is in air, active behaviors are classified simply as struggling versus immobility), and some mouse strains (particularly C57BL/6J) exhibit tail-climbing behavior that complicates scoring. Both assays detect the effects of acute antidepressant treatment (SSRIs, SNRIs, tricyclics) with high sensitivity, but they do not always agree: a compound may reduce immobility in the FST but not the TST, or vice versa, reflecting the fact that they engage partially different neurobiological substrates. Best practice in antidepressant screening is to use both assays when feasible, and to report which specific behaviors (immobility, swimming, climbing, struggling) were scored, how immobility was operationally defined, and the temporal resolution of scoring.

Inter-Rater Reliability and Cohen's Kappa

Inter-rater reliability (IRR) is the statistical quantification of agreement between two or more independent observers scoring the same behavioral event. In the context of FST and TST scoring, IRR is critical because the boundary between immobility and low-level activity involves subjective judgment that can vary between scorers, between laboratories, and even within the same scorer over time (intra-rater drift). The most widely used metric for two-rater categorical agreement is Cohen's kappa (1960), which adjusts for the proportion of agreement that would be expected by chance alone. The formula is kappa = (P_o - P_e) / (1 - P_e), where P_o is the observed proportion of agreement (the fraction of time-bins where both scorers assigned the same behavioral category) and P_e is the expected proportion of agreement under the null hypothesis that the two scorers assign categories independently based on their individual marginal distributions. Kappa ranges from -1 (perfect disagreement) through 0 (chance-level agreement) to +1 (perfect agreement). The standard interpretation scale (Landis and Koch, 1977) classifies kappa values as: below 0.00, poor; 0.00-0.20, slight; 0.21-0.40, fair; 0.41-0.60, moderate; 0.61-0.80, substantial; 0.81-1.00, almost perfect. For behavioral scoring in published FST and TST studies, a kappa of 0.80 or above is generally expected for the immobility category, and many journals and reviewers will question results scored with lower reliability. To maximize IRR, laboratories should: (1) develop written operational definitions with illustrative video examples for each behavioral category; (2) train all scorers on a common set of reference videos until they independently achieve kappa above 0.80; (3) have a subset of experimental trials (typically 10-20%) scored independently by two raters and report the kappa value in the methods section; (4) monitor for intra-rater drift by re-scoring a subset of trials at the end of the study and comparing to initial scores. This tool computes Cohen's kappa automatically when two scorers' data are loaded for the same trial, using time-bin-level agreement. The time-bin width affects the kappa value: narrower bins (e.g., 1 second) are more sensitive to temporal alignment differences between scorers, while wider bins (e.g., 5 seconds) may obscure genuine disagreements. A bin width of 1-5 seconds is typical for FST/TST inter-rater analyses.

Frequently Asked Questions