Beyond Course Averages: A Generalized Bayesian Hierarchical Framework for Course-Level Learning Evaluation

Authors

DOI:

https://doi.org/10.17309/jltm.2026.7.1.04

Keywords:

bayesian hierarchical modeling, multilevel analysis, course-level assessment, small sample instability, educational measurement

Abstract

Background. Course-level learning assessment in higher education is commonly based on comparisons of average performance indicators, implicitly assuming independence across courses and equal reliability of estimates. When enrollments are small and uneven, such approaches produce statistically unstable estimates and exaggerate extreme values, leading to potentially misleading interpretations.

Objectives. This study aims to develop a generalizable methodological framework for applying Bayesian hierarchical modeling (BHM) to course-level learning assessment, explicitly accounting for sampling uncertainty and unequal group sizes.

Materials and Methods. A Bayesian hierarchical model was specified in which student learning outcomes were modeled at the individual level while accounting for course membership. The model decomposes total variance into within-course and between-course components and estimates course-level effects using posterior distributions. Partial pooling was applied to stabilize estimates for courses with small enrollments. An empirical illustration was conducted using anonymized data from 279 students across 22 courses.

Results. Naïve comparisons based on course averages were found to systematically exaggerate extreme outcomes under small sample conditions, resulting in unstable and potentially misleading conclusions. The application of Bayesian hierarchical modeling substantially reduced artificial extremity while preserving statistically supported between-course differences. After pooling, most course effects were not distinguishable from the program average, while a limited number of courses showed consistent deviations.

Conclusions. Bayesian hierarchical modeling provides a statistically robust alternative to descriptive aggregation and course ranking. By incorporating uncertainty and stabilizing estimates, it enables more reliable interpretation of course-level performance and supports targeted, evidence-based academic evaluation.

Downloads

Download data is not yet available.

Author Biographies

Vicente Montano, University of Mindanao

Business Economics Department, College of Business Administration Education, Bolton St., 8000, Davao City, Philippines

Archie Reyes, University of Mindanao

Human Resource Management Department, College of Business Administration Education, Bolton St., 8000, Davao City, Philippines

References

Anwar, M.A., Ahmed, N., & Al Ameen, A.M. (2012). An Outcome-Based Assessment and Improvement System for Measuring Student Performance and Course Effectiveness. Contemporary Issues in Education Research, 5(4), 279-294. https://doi.org/10.19030/cier.v5i4.7272 DOI: https://doi.org/10.19030/cier.v5i4.7272

Cabrera, A.F., Colbeck, C.L., & Terenzini, P.T. (2001). Developing performance indicators for assessing classroom teaching practices and student learning. Research in higher education, 42(3), 327-352. https://doi.org/10.1023/A:1018874023323 DOI: https://doi.org/10.1023/A:1018874023323

Hristov, S., Nakov, D., & Miočinović, J. (2023). Constructive alignment between objectives, teaching and learning activities, student competencies and assessment methods in higher education. Journal of Agriculture and Plant Sciences, 21(2), 21-36. https://doi.org/10.46763/JAPS23212021h DOI: https://doi.org/10.46763/JAPS23212021h

Lewis, E. (2021). Best practices for improving the quality of the online course design and learners experience. The Journal of Continuing Higher Education, 69(1), 61-70. https://doi.org/10.1080/07377363.2020.1776558 DOI: https://doi.org/10.1080/07377363.2020.1776558

Kennedy, D. (2008). Linking Learning Outcomes and Assessment of Learning of Student Science Teachers. Science Education International, 19(4), 387-397. https://eric.ed.gov/?id=EJ890648&utm_source=chatgpt.com

Button, K.S., Ioannidis, J.P., Mokrysz, C., Nosek, B.A., Flint, J., Robinson, E.S., & Munafò, M.R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature reviews neuroscience, 14(5), 365-376. https://doi.org/10.1038/nrn3475 DOI: https://doi.org/10.1038/nrn3475

Anderson, S.F., & Maxwell, S.E. (2017). Addressing the “replication crisis”: Using original studies to design replication studies with appropriate statistical power. Multivariate behavioral research, 52(3), 305-324. https://doi.org/10.1080/00273171.2017.1289361 DOI: https://doi.org/10.1080/00273171.2017.1289361

Mascha, E.J., & Vetter, T.R. (2018). Significance, errors, power, and sample size: the blocking and tackling of statistics. Anesthesia & Analgesia, 126(2), 691-698. https://doi.org/10.1213/ANE.0000000000002741. DOI: https://doi.org/10.1213/ANE.0000000000002741

Berry, S.M., Broglio, K.R., Groshen, S., & Berry, D.A. (2013). Bayesian hierarchical modeling of patient subpopulations: efficient designs of phase II oncology clinical trials. Clinical Trials, 10(5), 720-734. https://doi.org/10.1177/1740774513497539 DOI: https://doi.org/10.1177/1740774513497539

Vandendijck, Y., Faes, C., Kirby, R.S., Lawson, A., & Hens, N. (2016). Model-based inference for small area estimation with sampling weights. Spatial Statistics, 18, 455-473. https://doi.org/10.1016/j.spasta.2016.09.004 DOI: https://doi.org/10.1016/j.spasta.2016.09.004

Moeyaert, M., Rindskopf, D., Onghena, P., & Van den Noortgate, W. (2017). Multilevel modeling of single-case data: A comparison of maximum likelihood and Bayesian estimation. Psychological Methods, 22(4), 760. https://doi.org/10.1037/met0000136 DOI: https://doi.org/10.1037/met0000136

McGlothlin, A.E., & Viele, K. (2018). Bayesian hierarchical models. Jama, 320(22), 2365-2366. https://doi.org/10.1001/jama.2018.17977 DOI: https://doi.org/10.1001/jama.2018.17977

Chan, E.K. (2014). Standards and guidelines for validation practices: Development and evaluation of measurement instruments. In Validity and validation in social, behavioral, and health sciences (pp. 9-24). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-319-07794-9_2 DOI: https://doi.org/10.1007/978-3-319-07794-9_2

Birenbaum, M. (2007). Evaluating the assessment: Sources of evidence for quality assurance. Studies in Educational Evaluation, 33(1), 29-49. https://doi.org/10.1016/j.stueduc.2007.01.004 DOI: https://doi.org/10.1016/j.stueduc.2007.01.004

Ramezani, S.G., & Mostafavi, Z.S. (2025). Developing and validating a comprehensive scale for accreditation standards and quality assurance in e-learning institutions. Education and Information Technologies, 1-49. https://doi.org/10.1007/s10639-025-13587-5 DOI: https://doi.org/10.1007/s10639-025-13587-5

Baartman, L.K., Bastiaens, T.J., Kirschner, P.A., & Van der Vleuten, C.P. (2007). Evaluating assessment quality in competence-based education: A qualitative comparison of two frameworks. Educational research review, 2(2), 114-129. https://doi.org/10.1016/j.edurev.2007.06.001 DOI: https://doi.org/10.1016/j.edurev.2007.06.001

Inglis, A. (2008). Approaches to the validation of quality frameworks for e‐learning. Quality Assurance in Education, 16(4), 347-362. https://doi.org/10.1108/09684880810906490 DOI: https://doi.org/10.1108/09684880810906490

Whiting, P., Wolff, R., Mallett, S., Simera, I., & Savović, J. (2017). A proposed framework for developing quality assessment tools. Systematic reviews, 6(1), 204. https://doi.org/10.1186/s13643-017-0604-6 DOI: https://doi.org/10.1186/s13643-017-0604-6

Bentley, T.G., Cohen, J.T., Elkin, E.B., Huynh, J., Mukherjea, A., Neville, T.H., ... & Broder, M.S. (2017). Validity and reliability of value assessment frameworks for new cancer drugs. Value in Health, 20(2), 200-205. https://doi.org/10.1016/j.jval.2016.12.011 DOI: https://doi.org/10.1016/j.jval.2016.12.011

Kruger, T., & Leuro, J. (2015, September). Using Quality Assurance Principles to Help Ensure the Validity and Reliability of Competency Assessments. In SPE Offshore Europe Conference and Exhibition (pp. SPE-175491). SPE. https://doi.org/10.2118/175491-MS DOI: https://doi.org/10.2118/175491-MS

Feiler, P.H., Goodenough, J.B., Gurfinkel, A., Weinstock, C.B., & Wrage, L. (2012). Reliability validation and improvement framework (No. CMUSEI2012SR013). https://www.sei.cmu.edu/documents/1918/2012_003_001_34081.pdf DOI: https://doi.org/10.21236/ADA610905

Smidt, A., Balandin, S., Sigafoos, J., & Reed, V.A. (2009). The Kirkpatrick model: A useful tool for evaluating training outcomes. Journal of Intellectual and Developmental Disability, 34(3), 266-274. https://doi.org/10.1080/13668250903093125 DOI: https://doi.org/10.1080/13668250903093125

Praslova, L. (2010). Adaptation of Kirkpatrick’s four level model of training criteria to assessment of learning outcomes and program evaluation in higher education. Educational assessment, evaluation and accountability, 22(3), 215-225. https://doi.org/10.1007/s11092-010-9098-7 DOI: https://doi.org/10.1007/s11092-010-9098-7

Cheung, V.K. L., Chia, N.H., So, S.S., Ng, G.W. Y., & So, E.H. K. (2023). Expanding scope of Kirkpatrick model from training effectiveness review to evidence-informed prioritization management for cricothyroidotomy simulation. Heliyon, 9(8). https://doi.org/10.1016/j.heliyon.2023.e18268 DOI: https://doi.org/10.1016/j.heliyon.2023.e18268

Thörn, J., Strandberg, P.E., Sundmark, D., & Afzal, W. (2022). Quality assuring the quality assurance tool: applying safety-critical concepts to test framework development. PeerJ Computer Science, 8, e1131. https://doi.org/10.7717/peerj-cs.1131 DOI: https://doi.org/10.7717/peerj-cs.1131

Nawaz, F., Ahmad, W., & Khushnood, M. (2022). Kirkpatrick model and training effectiveness: a meta-analysis 1982 to 2021. Business & Economic Review, 14(2), 35-56. https://doi.org/10.22547/BER/14.2.2 DOI: https://doi.org/10.22547/BER/14.2.2

Baldwin, S.A., & Fellingham, G.W. (2013). Bayesian methods for the analysis of small sample multilevel data with a complex variance structure. Psychological methods, 18(2), 151. https://doi.org/10.1037/a0030642 DOI: https://doi.org/10.1037/a0030642

Schmid, C.H., & Brown, E.N. (2000). Bayesian hierarchical models. Methods in enzymology, 321, 305-330. https://doi.org/10.1016/S0076-6879(00)21200-7 DOI: https://doi.org/10.1016/S0076-6879(00)21200-7

Columb, M.O., & Atkinson, M.S. (2016). Statistical analysis: sample size and power estimations. Bja Education, 16(5), 159-161. https://doi.org/10.1093/bjaed/mkv034 DOI: https://doi.org/10.1093/bjaed/mkv034

Chen, C., Wakefield, J., & Lumely, T. (2014). The use of sampling weights in Bayesian hierarchical models for small area estimation. Spatial and spatio-temporal epidemiology, 11, 33-43. https://doi.org/10.1016/j.sste.2014.07.002 DOI: https://doi.org/10.1016/j.sste.2014.07.002

Goodhue, D., Lewis, W., & Thompson, R. (2006, January). PLS, small sample size, and statistical power in MIS research. In Proceedings of the 39th Annual Hawaii International Conference on System Sciences (HICSS’06) (Vol. 8, pp. 202b-202b). IEEE. https://doi.org/10.1109/HICSS.2006.381 DOI: https://doi.org/10.1109/HICSS.2006.381

Monnahan, C.C., Thorson, J.T., & Branch, T.A. (2017). Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo. Methods in Ecology and Evolution, 8(3), 339-348. https://doi.org/10.1111/2041-210X.12681 DOI: https://doi.org/10.1111/2041-210X.12681

Bocquel, M., Papi, F., Podt, M., & Driessen, H. (2013). Multitarget tracking with multiscan knowledge exploitation using sequential MCMC sampling. IEEE Journal of Selected Topics in Signal Processing, 7(3), 532-542. https://doi.org/10.1109/JSTSP.2013.2251317 DOI: https://doi.org/10.1109/JSTSP.2013.2251317

Nguyen, T.D., Gupta, S., Rana, S., & Venkatesh, S. (2018). Stable bayesian optimization. International Journal of Data Science and Analytics, 6(4), 327-339. https://doi.org/10.1007/s41060-018-0119-9 DOI: https://doi.org/10.1007/s41060-018-0119-9

Kim, M., Ding, Y., Malcolm, P., Speeckaert, J., Siviy, C.J., Walsh, C.J., & Kuindersma, S. (2017). Human-in-the-loop Bayesian optimization of wearable device parameters. PloS one, 12(9), e0184054. https://doi.org/10.1371/journal.pone.0184054 DOI: https://doi.org/10.1371/journal.pone.0184054

Stern, H.S., & Sinharay, S. (2005). Bayesian model checking and model diagnostics. Handbook of Statistics, 25, 171-192. https://doi.org/10.1016/S0169-7161(05)25007-1 DOI: https://doi.org/10.1016/S0169-7161(05)25006-X

Koch, K.R. (2018). Bayesian statistics and Monte Carlo methods. Journal of Geodetic Science, 8(1), 18-29. https://doi.org/10.1515/jogs-2018-0003 DOI: https://doi.org/10.1515/jogs-2018-0003

Chen, J.J., Lai, P.C., & Huang, Y.T. (2025). Bayesian reanalysis reinforces the potential mortality benefit of TNF-α inhibitors in COVID-19: a methodological perspective. Critical Care, 29(1), 250. https://doi.org/10.1186/s13054-025-05506-4 DOI: https://doi.org/10.1186/s13054-025-05506-4

Gajewski, B.J., Simon, S.D., & Carlson, S.E. (2008). Predicting accrual in clinical trials with Bayesian posterior predictive distributions. Statistics in medicine, 27(13), 2328-2340. https://doi.org/10.1002/sim.3128 DOI: https://doi.org/10.1002/sim.3128

Feng, Y., Gao, K., & Lacasse, S. (2024). Bayesian partial pooling to reduce uncertainty in overcoring rock stress estimation. Journal of Rock Mechanics and Geotechnical Engineering, 16(4), 1192-1201. https://doi.org/10.1016/j.jrmge.2023.05.003 DOI: https://doi.org/10.1016/j.jrmge.2023.05.003

Downloads

Published

2026-04-30

How to Cite

Montano, V., & Reyes, A. (2026). Beyond Course Averages: A Generalized Bayesian Hierarchical Framework for Course-Level Learning Evaluation. Journal of Learning Theory and Methodology, 7(1), 37–48. https://doi.org/10.17309/jltm.2026.7.1.04

Issue

Section

Original Scientific Articles

Similar Articles

You may also start an advanced similarity search for this article.