Medical Data: The Core of Research

Analyzing data is one of the most important and exciting parts of any health-related study or clinical trial. In the end, only data can answer all research questions and hypotheses. Just like with any other aspect of research, the analysis of data should undergo some careful consideration. It’s not surprising that good documentation is a must and every step of the data analysis should be recorded. In fact, good data management practices should provide excellent records and visual representations. Note that using a logbook is a good technique to support data analysis and documentation practices.

Most of all, data should be objective, clear, and truthful (Peat, 2011). Let’s not forget that it’s unethical to adjust information and datasets only to get statistically significant results, which are not clinically important. In the end, patients’ well-being and quality of life come before numbers and reports.


Data Analysis: Planning the Analysis

Before researchers start with the data analysis, there are a few recommendations that should be considered. The first step experts must take is to perform a univariate analysis. Only after that, bivariate and multivariate analyses can be conducted. This gives researchers a chance to analyze each variable in detail (Peat, 2011). It also helps data analysts get some meaningful insights into aspects, such as the type of variables, the range of values, possible errors, and skewness.

It’s important to mention that when data is not normally distributed, experts should employ non-parametric statistics or turn to the transformation of data. Again, all steps must be entered into the logbook of the study.


Creating the Right Questions

Before proceeding with the analysis, researchers must have clear ideas and methods to deal with missing data. Missing data when random may affect the statistical power of the study but not the actual results. For instance, it’s common for participants to skip an item by accident, especially when the visual presentation of the questionnaire is not clear (Peat, 2011). Therefore, surveys and apps should be simple, clear, and user-friendly. If possible, experts can contact those subjects and ask them to clarify any missing information.

When this tendency is non-random, though, missing data may affect the generalizability of the study. As a matter of fact, people often avoid revealing information about their economic status, which may affect the results. In fact, a helpful tip is to include missing data in the prevalence rates but not the bivariate or multivariate analyses. For continuous data, on the other hand, a mean value can be computed. Note that in this case, using the mean of the total sample is considered a conservative approach, so it’s better to use subgroup mean values.


Analyzing Outliers

Outliers are called incorrect values or anomalies. They can be described as figures that lie far outside the norm. In fact, when the research sample is small, outliers can lead to type I and II errors. That means that if outliers are included in the analysis, generalizability may be affected.

To deal with abnormalities, researchers can simply delete any outlying values (Peat, 2011). Another approach is to recode the outliers and replace them with values that are closer to the mean. All steps should be documented.


Categorizing Variables

Defining and categorizing variables are also crucial aspects of the data analysis. Not surprisingly, before any bivariate or multivariate analyses, all variables should be categorized.

Note that outcome variables are the dependent variables, which can be placed on the y-axis. Intervening variables, such as secondary and alternative outcomes, also go on the y-axis. On the other hand, explanatory variables, called independent variables, risk factors, exposures, and predictors should be plotted on the x-axis (Peat, 2011).


Data Documentation & Research

Data documentation is paramount. As explained above, all steps of the data analysis – from recoding outliers to performing univariate analyses – should be documented in a data management file (Peat, 2011). Aspects, such as structure, location, coding, and missing values, must be documented and kept safely.

Although digital solutions support clinical trials, print-outs should also be stored and secured. Files should be kept together and labeled accordingly. In the end, documentation should ensure transparency and interoperability.


Interim Analyses

Dealing with data is tricky. Interim analyses are crucial in research as they can support good ethical principles and management practices. However, experts should try to minimize the number of interim analyses. The most desirable option is to have the dataset completed before the actual start of the analysis.

Note that medical data is sensitive. Therefore, it should be used only for the purposes it was collected for and for the hypotheses which were formulated prior to the study. Otherwise, data can be misused, which is a phenomenon known as data dredging or fishing expeditions (Peat, 2011). Data dredging is the practice of analyzing big datasets only to find relationships that don’t exist.  In fact, cross-sectional and case-control studies are sometimes prone to such practices.

Still, some datasets can be explored a step further. If a new study has developed an appropriate study design, an existing set can be used more than once to explore new relationships. In fact, in high-quality data, results that were not anticipated would not interfere with the study. Note that research hypotheses should have biological plausibility, which means that there would be a cause-and-effect relationship between a factor and a disease.


Data Analysis: The Methods

After all the corrections have been made, researchers can start with the actual statistical analysis. We should mention that the results represent the sample, which supposedly represents the population. However, in practice there are numerous differences between samples and populations, so the information may vary as the random sampling is repeated. This is a phenomenon known as sampling variability, which researchers try to eliminate.


Univariate Methods

Univariate methods can help experts explore each variable. The frequencies of categorical data and the distribution of continuous variables should be calculated in order to gain meaningful results. There are tables which can help researchers decide if categorical data and groups should be combined or if procedures like chi-square (e.g., Pearson’s chi-square, Fisher’s exact test, etc.) should be conducted. Note that for categorical data with ordered categories, non-parametric statistics can be employed.

Since categories with small numbers can affect the results significantly, groups can be combined. This can be done by analyzing the distribution of each variable (Peat, 2011), which, as mentioned above, should be done before the start of any bivariate or multivariate analyses


Continuous Data

Although the analysis of categorical data is pretty straightforward, continuous and discrete data also offer numerous insights. For continuous data, experts should check if figures are normally distributed or skewed. If not, the transformation of data or non-parametric methods should be considered (Peat, 2011). Yet, note that parametric methods provide clearer results for the same sample size when compared to non-parametric methods.

Apart from utilizing the pathway above, experts should calculate basic summary statistics, such as distribution, mean, standard deviation, median, and range of each variable (Peat, 2011). Note that the mean is defined as the average value of data. Median is the central point of data, and half of the measures lie below and half above it. Range, on the other hand, is defined as the measure of the spread from the lowest to the highest value.

When it comes to continuous data, experts should understand that medians and mean values are identical when there’s a normal distribution, and different when there’s a skewed distribution: Also, it’s important to understand that if the datasets are skewed to the right, the mean will be an over-estimate of the median value. On the other hand, if the datasets are skewed to the left, the mean will be an under-estimate of the median value. Note that in case the mean values and the median values cannot be revealed, there are other formulas that experts can employ (often by calculating the 95% range of the value).


Confidence Intervals

Confidence intervals are paramount values in research and data analysis. Just like with the mean and the median values, confidence intervals can help experts and statisticians make sense of their datasets. Confidence intervals indicate a range of values within which the true summary statistic can be found (Streiner, 1996).

Interestingly, the 95% confidence interval is defined as an estimate of the range in which there is a 95% chance that the true value lies (Peat, 2011). While the 95% confidence interval measures precision, it’s important to remember that it differs from the interval defined by the mean +/-2 standard deviations (SDs). To be more precise, the SD indicates the variability of the original data points. The confidence intervals, on the other hand, are constructed based on the standard error of the mean, which is the variability of the mean values. Both values can help experts analyze their datasets in depth.


Baseline Comparisons

Baselines comparisons are vital. One of the first steps experts need to take is to compare vital characteristics, such as confounders and other effects that may affect the results. Randomized trials usually ensure transparency and balance of confounders. However, some researchers may perform a significance test in order to compare a baseline with a final measurement in each separate group (Bland & Altman, 2011). Note that often mean values and SDs are needed to report baseline characteristics of continuously distributed data (instead of standard error or 95% confidence interval).

Most of all experts should understand that sometimes statistical tests are not enough, and the absolute differences between the subjects are better indicators to test any possible clinical differences between groups (Peat, 2011). In the end, all findings should be adjusted for bivariate and multivariate analyses, such as multiple regression.


Bivariate and Multivariate Methods

After computing fundamental initial steps, such as univariate analyses, distribution of variables, categorization of variables, and investigation of baseline characteristics, it’s time to continue further. Although bivariate and multivariate methods may sound more complicated, there are clear formulas, which experts can use.

Note that depending on the number of outcomes, there are different statistical techniques (Peat, 2011). Some of the main procedures for studies with one outcome are McNemar’s test, Kappa, Friedman’s analysis of variance, paired t-test, and intra-class correlation. For two outcomes, on the other hand, statistics, such as likelihood ration, Kendall’s correlation, Wilcoxon test, Mann-Whitney test, and logistic regression, can be employed. Canonical correlation and adjusted odds ratios can be utilized for more than two outcomes. Note that multiple regression and logistic regression are two of the most popular tests employed in any multivariate analysis.


Visual Representation

To sum up, dealing with data may seem complicated. Yet, it’s one of the most exciting aspects of research. There are numerous variables, categories, types of analyses, and statistical methods to help researchers analyze medical information. Most of all, to make sense out of this maze of figures and values, visual representations are crucial.

Experts should always provide tables, graphs, and charts to present their findings and ensure transparency. In the end, it’s not a secret that good documentation practices support scientific research. In practice, the visual representation can enhance communication between doctors and patients and boost the accuracy of diagnostic inferences (Garcia-Retamero & Hoffrage, 2013).

Because data is not only an abstract notion – data should be utilized to support people’s well-being and scientific progress.



Bland, J., & Altman, D. (2011). Comparisons against baseline within randomised groups are often used and can be highly misleading. Trials, 12.

Garcia-Retamero, R., & Hoffrage, U. (2013). Visual representation of statistical information improves diagnostic inferences in doctors and their patients. Social Science and Medicine, 83, p.27-33.

Peat, J. (2011). Analysing the data. Health Science Research: SAGE Publications, Ltd.

Streiner, D. (1996). Maintaining standards: differences between the standard deviation and standard error, and when to use each. The Canadian Journal of Psychiatry, 41(8), p. 498-502.

Why Is Continuous Data “Better” than Categorical or Discrete Data? (2017, April 7). Retrieved from