ToolsConductScience tool
Live ScoringFree in-browser calculator

FST/TST Live Scoring Pad.

Score forced swim test or tail suspension test behavior in real-time with keyboard shortcuts. Get immobility %, latency, bout analysis, time-bin breakdown, group comparisons, and inter-rater reliability.

PrivateData stays in your browser
LiveNo sign-up required
Validated2026-04-05
CitableMethods and citation included

Calculator

Results update in place

Try it out

Load example FST/TST scoring data to see the full workflow

Assay Configuration

Behaviors
IImmobilitySSwimmingCClimbing

Animal Details

Scoring Pad — Forced Swim Test

0:00.0
/ 6:00.0

When to use

  • Score Forced Swim Test trials in real time using keyboard shortcuts for immobility, swimming, and climbing behaviors
  • Score Tail Suspension Test trials in real time using keyboard shortcuts for immobility and active struggling
  • Validate automated video scoring systems (ANY-maze, EthoVision, ConductVision) by comparing against manual scores
  • Assess inter-rater reliability between two scorers using Cohen's kappa on time-bin data
  • Analyze pre-scored behavioral event data by importing timestamped logs for time-bin and group-level statistics
  • Compare treatment groups by exporting scored data for downstream statistical analysis (t-tests, ANOVA)

Do not use for

  • Automated video-based behavior classification — use ConductVision, which classifies immobility, swimming, and climbing automatically from video recordings without manual scoring
  • Pre-scored immobility totals that only need group statistics — use the FST Immobility Calculator or TST Immobility Calculator for batch analysis of already-quantified data
  • Non-despair behavioral assays (open field, elevated plus maze, Morris water maze) — this tool's behavioral categories and scoring logic are specific to the FST and TST

Blind the scorer to treatment group

Expectation bias is the most common threat to validity in manual behavioral scoring. The scorer should never know which treatment group the animal belongs to during scoring. Assign animals coded IDs and decode group assignments only after all scoring is complete. If live scoring during the experiment, ensure the scorer is not the same person who administered the drug.

Define immobility operationally before scoring begins

The boundary between immobility and minimal movement is the primary source of inter-rater disagreement. Write an explicit operational definition (e.g., "absence of all movement except minor hind-limb paddling necessary to keep the nose above water") and train all scorers on reference videos until Cohen's kappa exceeds 0.80 before scoring experimental data. Document the definition in your methods section.

Use consistent time-bin widths across studies

Time-bin width affects both the granularity of your time-course data and the Cohen's kappa value. Narrower bins (1 second) capture behavioral transitions precisely but amplify temporal alignment noise between scorers. Wider bins (60 seconds) smooth over brief behaviors. A 5-second bin width is a common compromise for FST/TST inter-rater analyses. Whatever width you choose, apply it consistently across all animals and experiments in a study.

Score from the correct trial segment

In the rat FST with a pretest, the test session is typically the last 5 minutes of a 6-minute swim (the first minute is often excluded as habituation). In the mouse FST without a pretest, the full 6-minute session is usually scored. In the TST, the standard is 6 minutes from the moment the animal is fully suspended. Ensure your scoring window matches the protocol described in your methods, and start the timer at the correct event.

Do not confuse behavioral despair with learned helplessness

The FST and TST measure behavioral despair (acute coping strategy in an inescapable situation), not learned helplessness (a distinct paradigm involving prior exposure to uncontrollable stress followed by escape testing). The terms are sometimes used interchangeably in the literature, but they describe different constructs with different neurobiological substrates. Use "immobility" or "behavioral despair" rather than "depression" when describing FST/TST outcomes, as immobility reflects a coping strategy rather than a mood state.

1

Method

The scoring pad uses real-time keyboard event capture to record behavioral state transitions with millisecond resolution. Behavioral categories are mutually exclusive: pressing a behavior key ends the previous behavior and begins the new one, producing a continuous gapless record. Trial timing is managed by a high-resolution browser timer (performance.now()). Time-bin analysis divides the trial into equal-width intervals and computes the proportion of each behavior within each bin. Inter-rater reliability is assessed using Cohen's kappa, computed on a per-bin basis by comparing two independent scorers' majority-behavior assignments for each bin. The tool supports configurable trial duration, time-bin width, and behavioral category labels for both FST (immobility, swimming, climbing) and TST (immobility, struggling) protocols. All computation is performed client-side — no data leaves your browser.

2

Validated

Last validated 2026-04-05. Calculations are designed for planning and documentation support; verify procurement decisions against manufacturer specifications or institutional SOPs.

3

How to cite

How to Cite

ConductScience FST/TST Live Scoring Pad (v1.0). ConductScience, Inc. 2026. Available at: https://conductscience.com/tools/fst-tst-live-scoring-pad

This tool facilitates manual behavioral scoring and computes descriptive statistics and inter-rater reliability metrics. It does not perform inferential statistical tests (e.g., t-tests, ANOVA) on group comparisons. Scored data should be exported and analyzed using appropriate statistical software. Behavioral scoring is inherently subjective; users should establish and report operational definitions and inter-rater reliability for their specific study.

Manual Behavioral Scoring: Principles and Practice

Manual behavioral scoring remains the gold standard for validation in preclinical behavioral neuroscience, despite the availability of automated tracking systems. The fundamental principle is that a trained human observer classifies an animal's behavior into mutually exclusive categories at each moment in time. For the Forced Swim Test, the standard categories are immobility (floating with only minor movements to maintain head position), swimming (horizontal movement through the water, typically associated with serotonergic mechanisms), and climbing (vigorous upward-directed movements against the cylinder wall, associated with noradrenergic mechanisms). For the Tail Suspension Test, the primary distinction is between immobility (passive hanging without limb or body movement) and active struggling (curling, swinging, or limb movement). The reliability of manual scoring depends on three factors: clear operational definitions that minimize ambiguity at behavioral boundaries, consistent training across all scorers using reference videos with known scores, and blinding of the scorer to treatment group assignment to prevent expectation bias. Scoring can be performed live during the test session or from video recordings. Live scoring has the advantage of real-time data availability and avoids the storage and review time required for video, but it does not allow the scorer to pause, rewind, or review ambiguous moments. Video-based scoring allows frame-by-frame analysis but introduces the risk of altered perception at different playback speeds. For both approaches, the scorer should maintain a consistent temporal sampling strategy — either continuous real-time coding (as this tool provides) or instantaneous sampling at fixed intervals (e.g., scoring behavior every 5 seconds). Continuous coding captures the exact duration of each behavior and is more sensitive to brief behavioral changes, while instantaneous sampling is simpler but can miss short-duration events. Whichever method is chosen, it must be reported in the methods section and applied consistently across all animals in the study.

FST vs TST: When to Use Each Assay

The Forced Swim Test and Tail Suspension Test are the two most widely used acute assays for assessing despair-like behavior in rodents, and choosing between them depends on species, research question, and practical constraints. The FST was introduced by Porsolt, Le Pichon, and Jalfre in 1977 using rats, and was later adapted for mice. It involves placing the animal in a cylinder of water (25 +/- 1 degrees Celsius) from which it cannot escape, and measuring immobility during a test session (typically 5-6 minutes), often preceded by a 15-minute pretest swim 24 hours earlier in rats (the pretest is sometimes omitted in mice). The FST is applicable to both rats and mice and has the advantage of allowing differentiation between swimming (serotonergic) and climbing (noradrenergic/catecholaminergic) active behaviors, as demonstrated by Detke, Rickels, and Lucki (1995). However, the FST involves water immersion, which introduces the confounds of hypothermia, wet fur affecting subsequent testing, and the physical stress of swimming. The TST was developed by Steru, Chermat, Thierry, and Simon in 1985 as an alternative that avoids these water-related confounds. Mice are suspended by the tail using adhesive tape, and immobility is measured over a 6-minute trial with no pretest required. The TST is faster to set up, avoids hypothermia, and produces highly reproducible results in mice. However, it is generally limited to mice (rats are too heavy for reliable suspension), cannot distinguish between swimming and climbing (since the animal is in air, active behaviors are classified simply as struggling versus immobility), and some mouse strains (particularly C57BL/6J) exhibit tail-climbing behavior that complicates scoring. Both assays detect the effects of acute antidepressant treatment (SSRIs, SNRIs, tricyclics) with high sensitivity, but they do not always agree: a compound may reduce immobility in the FST but not the TST, or vice versa, reflecting the fact that they engage partially different neurobiological substrates. Best practice in antidepressant screening is to use both assays when feasible, and to report which specific behaviors (immobility, swimming, climbing, struggling) were scored, how immobility was operationally defined, and the temporal resolution of scoring.

Inter-Rater Reliability and Cohen's Kappa

Inter-rater reliability (IRR) is the statistical quantification of agreement between two or more independent observers scoring the same behavioral event. In the context of FST and TST scoring, IRR is critical because the boundary between immobility and low-level activity involves subjective judgment that can vary between scorers, between laboratories, and even within the same scorer over time (intra-rater drift). The most widely used metric for two-rater categorical agreement is Cohen's kappa (1960), which adjusts for the proportion of agreement that would be expected by chance alone. The formula is kappa = (P_o - P_e) / (1 - P_e), where P_o is the observed proportion of agreement (the fraction of time-bins where both scorers assigned the same behavioral category) and P_e is the expected proportion of agreement under the null hypothesis that the two scorers assign categories independently based on their individual marginal distributions. Kappa ranges from -1 (perfect disagreement) through 0 (chance-level agreement) to +1 (perfect agreement). The standard interpretation scale (Landis and Koch, 1977) classifies kappa values as: below 0.00, poor; 0.00-0.20, slight; 0.21-0.40, fair; 0.41-0.60, moderate; 0.61-0.80, substantial; 0.81-1.00, almost perfect. For behavioral scoring in published FST and TST studies, a kappa of 0.80 or above is generally expected for the immobility category, and many journals and reviewers will question results scored with lower reliability. To maximize IRR, laboratories should: (1) develop written operational definitions with illustrative video examples for each behavioral category; (2) train all scorers on a common set of reference videos until they independently achieve kappa above 0.80; (3) have a subset of experimental trials (typically 10-20%) scored independently by two raters and report the kappa value in the methods section; (4) monitor for intra-rater drift by re-scoring a subset of trials at the end of the study and comparing to initial scores. This tool computes Cohen's kappa automatically when two scorers' data are loaded for the same trial, using time-bin-level agreement. The time-bin width affects the kappa value: narrower bins (e.g., 1 second) are more sensitive to temporal alignment differences between scorers, while wider bins (e.g., 5 seconds) may obscure genuine disagreements. A bin width of 1-5 seconds is typical for FST/TST inter-rater analyses.

Frequently asked

325
Free tools
1,200+
Institutions
100%
Client-side
0
Uploads required