Beyond Course Averages: A Generalized Bayesian Hierarchical Framework for Course-Level Learning Evaluation
DOI:
https://doi.org/10.17309/jltm.2026.7.1.04Keywords:
bayesian hierarchical modeling, multilevel analysis, course-level assessment, small sample instability, educational measurementAbstract
Background. Course-level learning assessment in higher education is commonly based on comparisons of average performance indicators, implicitly assuming independence across courses and equal reliability of estimates. When enrollments are small and uneven, such approaches produce statistically unstable estimates and exaggerate extreme values, leading to potentially misleading interpretations.
Objectives. This study aims to develop a generalizable methodological framework for applying Bayesian hierarchical modeling (BHM) to course-level learning assessment, explicitly accounting for sampling uncertainty and unequal group sizes.
Materials and Methods. A Bayesian hierarchical model was specified in which student learning outcomes were modeled at the individual level while accounting for course membership. The model decomposes total variance into within-course and between-course components and estimates course-level effects using posterior distributions. Partial pooling was applied to stabilize estimates for courses with small enrollments. An empirical illustration was conducted using anonymized data from 279 students across 22 courses.
Results. Naïve comparisons based on course averages were found to systematically exaggerate extreme outcomes under small sample conditions, resulting in unstable and potentially misleading conclusions. The application of Bayesian hierarchical modeling substantially reduced artificial extremity while preserving statistically supported between-course differences. After pooling, most course effects were not distinguishable from the program average, while a limited number of courses showed consistent deviations.
Conclusions. Bayesian hierarchical modeling provides a statistically robust alternative to descriptive aggregation and course ranking. By incorporating uncertainty and stabilizing estimates, it enables more reliable interpretation of course-level performance and supports targeted, evidence-based academic evaluation.
Downloads
References
Anwar, M.A., Ahmed, N., & Al Ameen, A.M. (2012). An Outcome-Based Assessment and Improvement System for Measuring Student Performance and Course Effectiveness. Contemporary Issues in Education Research, 5(4), 279-294. https://doi.org/10.19030/cier.v5i4.7272 DOI: https://doi.org/10.19030/cier.v5i4.7272
Cabrera, A.F., Colbeck, C.L., & Terenzini, P.T. (2001). Developing performance indicators for assessing classroom teaching practices and student learning. Research in higher education, 42(3), 327-352. https://doi.org/10.1023/A:1018874023323 DOI: https://doi.org/10.1023/A:1018874023323
Hristov, S., Nakov, D., & Miočinović, J. (2023). Constructive alignment between objectives, teaching and learning activities, student competencies and assessment methods in higher education. Journal of Agriculture and Plant Sciences, 21(2), 21-36. https://doi.org/10.46763/JAPS23212021h DOI: https://doi.org/10.46763/JAPS23212021h
Lewis, E. (2021). Best practices for improving the quality of the online course design and learners experience. The Journal of Continuing Higher Education, 69(1), 61-70. https://doi.org/10.1080/07377363.2020.1776558 DOI: https://doi.org/10.1080/07377363.2020.1776558
Kennedy, D. (2008). Linking Learning Outcomes and Assessment of Learning of Student Science Teachers. Science Education International, 19(4), 387-397. https://eric.ed.gov/?id=EJ890648&utm_source=chatgpt.com
Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., & Munafò, M.R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews neuroscience, 14(5), 365-376. https://doi.org/10.1038/nrn3475 DOI: https://doi.org/10.1038/nrn3475
Anderson, S.F., & Maxwell, S.E. (2017). Addressing the “replication crisis”: Using original studies to design replication studies with appropriate statistical power. Multivariate behavioral research, 52(3), 305-324. https://doi.org/10.1080/00273171.2017.1289361 DOI: https://doi.org/10.1080/00273171.2017.1289361
Mascha, E.J., & Vetter, T.R. (2018). Significance, errors, power, and sample size: the blocking and tackling of statistics. Anesthesia & Analgesia, 126(2), 691-698. https://doi.org/10.1213/ANE.0000000000002741. DOI: https://doi.org/10.1213/ANE.0000000000002741
Berry, S.M., Broglio, K.R., Groshen, S., & Berry, D.A. (2013). Bayesian hierarchical modeling of patient subpopulations: efficient designs of phase II oncology clinical trials. Clinical Trials, 10(5), 720-734. https://doi.org/10.1177/1740774513497539 DOI: https://doi.org/10.1177/1740774513497539
Vandendijck, Y., Faes, C., Kirby, R.S., Lawson, A., & Hens, N. (2016). Model-based inference for small area estimation with sampling weights. Spatial Statistics, 18, 455-473. https://doi.org/10.1016/j.spasta.2016.09.004 DOI: https://doi.org/10.1016/j.spasta.2016.09.004
Moeyaert, M., Rindskopf, D., Onghena, P., & Van den Noortgate, W. (2017). Multilevel modeling of single-case data: A comparison of maximum likelihood and Bayesian estimation. Psychological Methods, 22(4), 760. https://doi.org/10.1037/met0000136 DOI: https://doi.org/10.1037/met0000136
McGlothlin, A.E., & Viele, K. (2018). Bayesian hierarchical models. Jama, 320(22), 2365-2366. https://doi.org/10.1001/jama.2018.17977 DOI: https://doi.org/10.1001/jama.2018.17977
Chan, E.K. (2014). Standards and guidelines for validation practices: Development and evaluation of measurement instruments. In Validity and validation in social, behavioral, and health sciences (pp. 9-24). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-07794-9_2 DOI: https://doi.org/10.1007/978-3-319-07794-9_2
Birenbaum, M. (2007). Evaluating the assessment: Sources of evidence for quality assurance. Studies in Educational Evaluation, 33(1), 29-49. https://doi.org/10.1016/j.stueduc.2007.01.004 DOI: https://doi.org/10.1016/j.stueduc.2007.01.004
Ramezani, S.G., & Mostafavi, Z.S. (2025). Developing and validating a comprehensive scale for accreditation standards and quality assurance in e-learning institutions. Education and Information Technologies, 1-49. https://doi.org/10.1007/s10639-025-13587-5 DOI: https://doi.org/10.1007/s10639-025-13587-5
Baartman, L.K., Bastiaens, T.J., Kirschner, P.A., & Van der Vleuten, C.P. (2007). Evaluating assessment quality in competence-based education: A qualitative comparison of two frameworks. Educational research review, 2(2), 114-129. https://doi.org/10.1016/j.edurev.2007.06.001 DOI: https://doi.org/10.1016/j.edurev.2007.06.001
Inglis, A. (2008). Approaches to the validation of quality frameworks for e‐learning. Quality Assurance in Education, 16(4), 347-362. https://doi.org/10.1108/09684880810906490 DOI: https://doi.org/10.1108/09684880810906490
Whiting, P., Wolff, R., Mallett, S., Simera, I., & Savović, J. (2017). A proposed framework for developing quality assessment tools. Systematic reviews, 6(1), 204. https://doi.org/10.1186/s13643-017-0604-6 DOI: https://doi.org/10.1186/s13643-017-0604-6
Bentley, T.G., Cohen, J.T., Elkin, E.B., Huynh, J., Mukherjea, A., Neville, T.H., ... & Broder, M.S. (2017). Validity and reliability of value assessment frameworks for new cancer drugs. Value in Health, 20(2), 200-205. https://doi.org/10.1016/j.jval.2016.12.011 DOI: https://doi.org/10.1016/j.jval.2016.12.011
Kruger, T., & Leuro, J. (2015, September). Using Quality Assurance Principles to Help Ensure the Validity and Reliability of Competency Assessments. In SPE Offshore Europe Conference and Exhibition (pp. SPE-175491). SPE. https://doi.org/10.2118/175491-MS DOI: https://doi.org/10.2118/175491-MS
Feiler, P.H., Goodenough, J.B., Gurfinkel, A., Weinstock, C.B., & Wrage, L. (2012). Reliability validation and improvement framework (No. CMUSEI2012SR013). https://www.sei.cmu.edu/documents/1918/2012_003_001_34081.pdf DOI: https://doi.org/10.21236/ADA610905
Smidt, A., Balandin, S., Sigafoos, J., & Reed, V.A. (2009). The Kirkpatrick model: A useful tool for evaluating training outcomes. Journal of Intellectual and Developmental Disability, 34(3), 266-274. https://doi.org/10.1080/13668250903093125 DOI: https://doi.org/10.1080/13668250903093125
Praslova, L. (2010). Adaptation of Kirkpatrick’s four level model of training criteria to assessment of learning outcomes and program evaluation in higher education. Educational assessment, evaluation and accountability, 22(3), 215-225. https://doi.org/10.1007/s11092-010-9098-7 DOI: https://doi.org/10.1007/s11092-010-9098-7
Cheung, V.K. L., Chia, N.H., So, S.S., Ng, G.W. Y., & So, E.H. K. (2023). Expanding scope of Kirkpatrick model from training effectiveness review to evidence-informed prioritization management for cricothyroidotomy simulation. Heliyon, 9(8). https://doi.org/10.1016/j.heliyon.2023.e18268 DOI: https://doi.org/10.1016/j.heliyon.2023.e18268
Thörn, J., Strandberg, P.E., Sundmark, D., & Afzal, W. (2022). Quality assuring the quality assurance tool: applying safety-critical concepts to test framework development. PeerJ Computer Science, 8, e1131. https://doi.org/10.7717/peerj-cs.1131 DOI: https://doi.org/10.7717/peerj-cs.1131
Nawaz, F., Ahmad, W., & Khushnood, M. (2022). Kirkpatrick model and training effectiveness: a meta-analysis 1982 to 2021. Business & Economic Review, 14(2), 35-56. https://doi.org/10.22547/BER/14.2.2 DOI: https://doi.org/10.22547/BER/14.2.2
Baldwin, S.A., & Fellingham, G.W. (2013). Bayesian methods for the analysis of small sample multilevel data with a complex variance structure. Psychological methods, 18(2), 151. https://doi.org/10.1037/a0030642 DOI: https://doi.org/10.1037/a0030642
Schmid, C.H., & Brown, E.N. (2000). Bayesian hierarchical models. Methods in enzymology, 321, 305-330. https://doi.org/10.1016/S0076-6879(00)21200-7 DOI: https://doi.org/10.1016/S0076-6879(00)21200-7
Columb, M.O., & Atkinson, M.S. (2016). Statistical analysis: sample size and power estimations. Bja Education, 16(5), 159-161. https://doi.org/10.1093/bjaed/mkv034 DOI: https://doi.org/10.1093/bjaed/mkv034
Chen, C., Wakefield, J., & Lumely, T. (2014). The use of sampling weights in Bayesian hierarchical models for small area estimation. Spatial and spatio-temporal epidemiology, 11, 33-43. https://doi.org/10.1016/j.sste.2014.07.002 DOI: https://doi.org/10.1016/j.sste.2014.07.002
Goodhue, D., Lewis, W., & Thompson, R. (2006, January). PLS, small sample size, and statistical power in MIS research. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06) (Vol. 8, pp. 202b-202b). IEEE. https://doi.org/10.1109/HICSS.2006.381 DOI: https://doi.org/10.1109/HICSS.2006.381
Monnahan, C.C., Thorson, J.T., & Branch, T.A. (2017). Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339-348. https://doi.org/10.1111/2041-210X.12681 DOI: https://doi.org/10.1111/2041-210X.12681
Bocquel, M., Papi, F., Podt, M., & Driessen, H. (2013). Multitarget tracking with multiscan knowledge exploitation using sequential MCMC sampling. IEEE Journal of Selected Topics in Signal Processing, 7(3), 532-542. https://doi.org/10.1109/JSTSP.2013.2251317 DOI: https://doi.org/10.1109/JSTSP.2013.2251317
Nguyen, T.D., Gupta, S., Rana, S., & Venkatesh, S. (2018). Stable bayesian optimization. International Journal of Data Science and Analytics, 6(4), 327-339. https://doi.org/10.1007/s41060-018-0119-9 DOI: https://doi.org/10.1007/s41060-018-0119-9
Kim, M., Ding, Y., Malcolm, P., Speeckaert, J., Siviy, C.J., Walsh, C.J., & Kuindersma, S. (2017). Human-in-the-loop Bayesian optimization of wearable device parameters. PloS one, 12(9), e0184054. https://doi.org/10.1371/journal.pone.0184054 DOI: https://doi.org/10.1371/journal.pone.0184054
Stern, H.S., & Sinharay, S. (2005). Bayesian model checking and model diagnostics. Handbook of Statistics, 25, 171-192. https://doi.org/10.1016/S0169-7161(05)25007-1 DOI: https://doi.org/10.1016/S0169-7161(05)25006-X
Koch, K.R. (2018). Bayesian statistics and Monte Carlo methods. Journal of Geodetic Science, 8(1), 18-29. https://doi.org/10.1515/jogs-2018-0003 DOI: https://doi.org/10.1515/jogs-2018-0003
Chen, J.J., Lai, P.C., & Huang, Y.T. (2025). Bayesian reanalysis reinforces the potential mortality benefit of TNF-α inhibitors in COVID-19: a methodological perspective. Critical Care, 29(1), 250. https://doi.org/10.1186/s13054-025-05506-4 DOI: https://doi.org/10.1186/s13054-025-05506-4
Gajewski, B.J., Simon, S.D., & Carlson, S.E. (2008). Predicting accrual in clinical trials with Bayesian posterior predictive distributions. Statistics in medicine, 27(13), 2328-2340. https://doi.org/10.1002/sim.3128 DOI: https://doi.org/10.1002/sim.3128
Feng, Y., Gao, K., & Lacasse, S. (2024). Bayesian partial pooling to reduce uncertainty in overcoring rock stress estimation. Journal of Rock Mechanics and Geotechnical Engineering, 16(4), 1192-1201. https://doi.org/10.1016/j.jrmge.2023.05.003 DOI: https://doi.org/10.1016/j.jrmge.2023.05.003
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2026 Vicente Montano, Archie Reyes

This work is licensed under a Creative Commons Attribution 4.0 International License.
