A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency

Ryan Schwarz; Hatice Cigdem Bulut; Charles Anifowose

doi:10.21449/ijate.1321061

Research Article

A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency

Year 2023, Volume: 10 Issue: Special Issue, 116 - 131, 27.12.2023

Ryan Schwarz Hatice Cigdem Bulut Charles Anifowose

https://doi.org/10.21449/ijate.1321061

Abstract

The increasing volume of large-scale assessment data poses a challenge for testing organizations to manage data and conduct psychometric analysis efficiently. Traditional psychometric software presents barriers, such as a lack of functionality for managing data and conducting various standard psychometric analyses efficiently. These challenges have resulted in high costs to achieve the desired research and analysis outcomes. To address these challenges, we have designed and implemented a modernized data pipeline that allows psychometricians and statisticians to efficiently manage the data, conduct psychometric analysis, generate technical reports, and perform quality assurance to validate the required outputs. This modernized pipeline has proven to scale with large databases, decrease human error by reducing manual processes, efficiently make complex workloads repeatable, ensure high quality of the outputs, and reduce overall costs of psychometric analysis of large-scale assessment data. This paper aims to provide information to support the modernization of the current psychometric analysis practices. We shared details on the workflow design and functionalities of our modernized data pipeline, which provide a universal interface to large-scale assessments. The methods for developing non-technical and user-friendly interfaces will also be discussed.

Keywords

Data pipelines, Psychometric analysis, Large-scale assessments, Data validation, Reporting

References

Addey, C., & Sellar, S. (2018). Why do countries participate in PISA? Understanding the role of international large-scale assessments in global education policy. In A. Verger, H.K. Altinyelken, & M. Novelli (Eds.), Global education policy and international development: New agendas, issues and policies (3rd ed., pp. 97–117). Bloomsbury Publishing.
Allaire, J., Xie, Y., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., ... & Iannone, R. (2022). rmarkdown: Dynamic Documents for R. R package version, 1(11).
Ansari, G.A., Parvez, M.T., & Al Khalifah, A. (2017). Cross-organizational information systems: A case for educational data mining. International Journal of Advanced Computer Science and Applications, 8(11), 170 175. http://dx.doi.org/10.14569/IJACSA.2017.081122
Azab, A. (2017, April). Enabling docker containers for high-performance and many-task computing. In 2017 ieee international conference on cloud engineering (ic2e) (pp. 279-285). IEEE.
Bezanson, J., Karpinski, S., Shah, V.B., & Edelman, A. (2012). Julia: A fast dynamic language for technical computing. ArXiv Preprint ArXiv:1209.5145.
Bertolini, R., Finch, S.J., & Nehm, R.H. (2021). Enhancing data pipelines for forecasting student performance: Integrating feature selection with cross-validation. International Journal of Educational Technology in Higher Education, 18(1), 1 23. https://doi.org/10.1186/s41239-021-00279-6
Bertolini, R., Finch, S.J., & Nehm, R.H. (2022). Quantifying variability in predictions of student performance: Examining the impact of bootstrap resampling in data pipelines. Computers and Education: Artificial Intelligence, 3, 100067. https://doi.org/10.1016/j.caeai.2022.100067
Bryant, W. (2019). Developing a strategy for using technology-enhanced items in large-scale standardized tests. Practical Assessment, Research, and Evaluation, 22(1), 1. https://doi.org/10.7275/70yb-dj34
Camara, W.J., & Harris, D.J. (2020). Impact of technology, digital devices, and test timing on score comparability. In M.J. Margolis, R.A. Feinberg (Eds.), Integrating timing considerations to improve testing practices (pp. 104-121). Routledge.
Chalmers. R.P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1 29. https://doi.org/10.18637/jss.v048.i06
Croudace, T., Ploubidis, G., & Abbott, R. (2005). BILOG-MG, MULTILOG, PARSCALE and TESTFACT. British Journal of Mathematical & Statistical Psychology, 58(1), 193. https://doi.org/10.1348/000711005X37529
Desjardins, C.D., & Bulut, O. (2018). Handbook of educational measurement and psychometrics using R. CRC Press.
Dogaru, I., & Dogaru, R. (2015, May). Using Python and Julia for efficient implementation of natural computing and complexity related algorithms. In 2015 20th International Conference on Control Systems and Computer Science (pp. 599-604). IEEE.
Dowle, M., & Srinivasan, A. (2023). data.table: Extension of 'data.frame'. https://r-datatable.com,https://Rdatatable.gitlab.io/data.table.
du Toit, M. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Scientific Software International.
Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Erlbaum.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory (Vol. 2). Sage.
Kamens, D.H., & McNeely, C.L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative education review, 54(1), 5-25. https://doi.org/10.1086/648471
Goodman, D.P., & Hambleton, R.K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17(2), 145-220. https://doi.org/10.1207/s15324818ame1702_3
Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed‐response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19-28. https://doi.org/10.1111/emip.12028
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Addison Wesley, Reading, MA.
Lynch, S. (2022). Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education. Practical Assessment, Research, and Evaluation, 27(1), 22.
IBM (2020). IBM SPSS Statistics for Windows, Version 27.0. IBM Corp.
Martinková, P., & Drabinová, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. R Journal, 10(2), 503-515.
Merkel, D. (2014). Docker: Lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
Microsoft Corporation. (2018). Microsoft Excel. Retrieved from https://office.microsoft.com/excel
Moncaleano, S., & Russell, M. (2018). A historical analysis of technological advances to educational testing: A drive for efficiency and the interplay with validity. Journal of Applied Testing Technology, 19(1), 1–19.
Morandat, F., Hill, B., Osvald, L., & Vitek, J. (2012). Evaluating the design of the R language: Objects and functions for data analysis. In ECOOP 2012–Object-Oriented Programming: 26th European Conference, Beijing, China, June 11-16, 2012. Proceedings 26 (pp. 104-131). Springer Berlin Heidelberg.
Muraki, E., & Bock, R.D. (2003). PARSCALE 4 for Windows: IRT based test scoring and item analysis for graded items and rating scales [Computer software]. Scientific Software International, Inc.
Oranje, A., & Kolstad, A. (2019). Research on psychometric modeling, analysis, and reporting of the national assessment of educational progress. Journal of Educational and Behavioral Statistics, 44(6), 648-670. https://doi.org/10.3102/1076998619867105
R Core Team (2022). R: Language and environment for statistical computing. (Version 4.2.1) [Computer software]. Retrieved from https://cran.r-project.org.
Reise, S.P., Ainsworth, A.T., & Haviland, M.G. (2005). Item response theory: Fundamentals, applications, and promise in psychological research. Current directions in psychological science, 14(2), 95-101.
Rupp, A.A. (2003). Item response modeling with BILOG-MG and MULTILOG for Windows. International Journal of Testing, 3(4), 365 384. https://doi.org/10.1207/S15327574IJT0304_5
Russell, M. (2016). A framework for examining the utility of technology-enhanced items. Journal of Applied Testing Technology, 17(1), 20-32.
Rutkowski, L., Gonzalez, E., Joncas, M., & Von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151. https://doi.org/10.3102/0013189X10363170
Scalise, K., & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for constructing" intermediate constraint" questions and tasks for technology platforms. The Journal of Technology, Learning and Assessment, 4(6).
Schauberger, P., & Walker, A (2022). openxlsx: Read, Write and Edit xlsx Files. https://ycphs.github.io/openxlsx/index.html, https://github.com/ycphs/openxlsx
Schleiss, J., Günther, K., & Stober, S. (2022). Protecting student data in ML Pipelines: An overview of privacy-preserving ML. In International Conference on Artificial Intelligence in Education (pp. 532-536). Springer, Cham.
Schloerke, B., & Allen, J. (2023). plumber: An API Generator for R. https://www.rplumber.io, https://github.com/rstudio/plumber
Schumacker, R. (2019). Psychometric packages in R. Measurement: Interdisciplinary Research and Perspectives, 17(2), 106-112. https://doi.org/10.1080/15366367.2018.1544434
Skiena, S.S. (2017). The data science design manual. Springer.
Sung, K.H., Noh, E.H., & Chon, K.H. (2017). Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Education Review, 18, 425-437. https://doi.org/10.1007/s12564-017-9498-1
Thissen, D., Chen, W-H, & Bock, R.D. (2003). MULTILOG 7 for Windows: Multiple category item analysis and test scoring using item response theory [Computer software]. Scientific Software International, Inc.
Van Rossum, G., & Drake Jr, F.L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
Volante, L., & Ben Jaafar, S. (2008). Educational assessment in Canada. Assessment in Education: Principles, Policy & Practice, 15(2), 201 210. https://doi.org/10.1080/09695940802164226
Weber, B. G. (2020). Data science in production: Building scalable model pipelines with Python. CreateSpace Independent Publishing.
Wickham, H. (2022). stringr: Simple, consistent wrappers for common string operations. https://stringr.tidyverse.org.
Wickham, H., François, R., Henry, L., & Müller, K. (2022). dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org.
Wickham, H., & Girlich, M. (2022). tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org
Wise, S.L. (2018). Computer-based testing. In the SAGE Encyclopedia of Educational Research, Measurement, and Evaluation (pp. 341–344). SAGE Publications, Inc.
Ysseldyke, J., & Nelson, J.R. (2002). Reporting results of student performance on large-scale assessments. In G. Tindal & T.M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation. (pp. 467-483). Routledge
Zenisky, A.L., & Sireci, S.G. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15(4), 337 362. https://doi.org/10.1207/S15324818AME1504_02

A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency

Year 2023, Volume: 10 Issue: Special Issue, 116 - 131, 27.12.2023

Ryan Schwarz Hatice Cigdem Bulut Charles Anifowose

https://doi.org/10.21449/ijate.1321061

Abstract

Keywords

Data pipelines, Psychometric analysis, Large-scale assessments, Data validation, Reporting

References

Addey, C., & Sellar, S. (2018). Why do countries participate in PISA? Understanding the role of international large-scale assessments in global education policy. In A. Verger, H.K. Altinyelken, & M. Novelli (Eds.), Global education policy and international development: New agendas, issues and policies (3rd ed., pp. 97–117). Bloomsbury Publishing.
Allaire, J., Xie, Y., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., ... & Iannone, R. (2022). rmarkdown: Dynamic Documents for R. R package version, 1(11).
Ansari, G.A., Parvez, M.T., & Al Khalifah, A. (2017). Cross-organizational information systems: A case for educational data mining. International Journal of Advanced Computer Science and Applications, 8(11), 170 175. http://dx.doi.org/10.14569/IJACSA.2017.081122
Azab, A. (2017, April). Enabling docker containers for high-performance and many-task computing. In 2017 ieee international conference on cloud engineering (ic2e) (pp. 279-285). IEEE.
Bezanson, J., Karpinski, S., Shah, V.B., & Edelman, A. (2012). Julia: A fast dynamic language for technical computing. ArXiv Preprint ArXiv:1209.5145.
Bertolini, R., Finch, S.J., & Nehm, R.H. (2021). Enhancing data pipelines for forecasting student performance: Integrating feature selection with cross-validation. International Journal of Educational Technology in Higher Education, 18(1), 1 23. https://doi.org/10.1186/s41239-021-00279-6
Bertolini, R., Finch, S.J., & Nehm, R.H. (2022). Quantifying variability in predictions of student performance: Examining the impact of bootstrap resampling in data pipelines. Computers and Education: Artificial Intelligence, 3, 100067. https://doi.org/10.1016/j.caeai.2022.100067
Bryant, W. (2019). Developing a strategy for using technology-enhanced items in large-scale standardized tests. Practical Assessment, Research, and Evaluation, 22(1), 1. https://doi.org/10.7275/70yb-dj34
Camara, W.J., & Harris, D.J. (2020). Impact of technology, digital devices, and test timing on score comparability. In M.J. Margolis, R.A. Feinberg (Eds.), Integrating timing considerations to improve testing practices (pp. 104-121). Routledge.
Chalmers. R.P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1 29. https://doi.org/10.18637/jss.v048.i06
Croudace, T., Ploubidis, G., & Abbott, R. (2005). BILOG-MG, MULTILOG, PARSCALE and TESTFACT. British Journal of Mathematical & Statistical Psychology, 58(1), 193. https://doi.org/10.1348/000711005X37529
Desjardins, C.D., & Bulut, O. (2018). Handbook of educational measurement and psychometrics using R. CRC Press.
Dogaru, I., & Dogaru, R. (2015, May). Using Python and Julia for efficient implementation of natural computing and complexity related algorithms. In 2015 20th International Conference on Control Systems and Computer Science (pp. 599-604). IEEE.
Dowle, M., & Srinivasan, A. (2023). data.table: Extension of 'data.frame'. https://r-datatable.com,https://Rdatatable.gitlab.io/data.table.
du Toit, M. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Scientific Software International.
Embretson, S.E., & Reise, S.P. (2000). Item response theory for psychologists. Erlbaum.
Hambleton, R.K., Swaminathan, H., & Rogers, H.J. (1991). Fundamentals of item response theory (Vol. 2). Sage.
Kamens, D.H., & McNeely, C.L. (2010). Globalization and the growth of international educational testing and national assessment. Comparative education review, 54(1), 5-25. https://doi.org/10.1086/648471
Goodman, D.P., & Hambleton, R.K. (2004). Student test score reports and interpretive guides: Review of current practices and suggestions for future research. Applied Measurement in Education, 17(2), 145-220. https://doi.org/10.1207/s15324818ame1702_3
Liu, O.L., Brew, C., Blackmore, J., Gerard, L., Madhok, J., & Linn, M.C. (2014). Automated scoring of constructed‐response science items: Prospects and obstacles. Educational Measurement: Issues and Practice, 33(2), 19-28. https://doi.org/10.1111/emip.12028
Lord, F.M., & Novick, M.R. (1968). Statistical theories of mental test scores. Addison Wesley, Reading, MA.
Lynch, S. (2022). Adapting paper-based tests for computer administration: Lessons learned from 30 years of mode effects studies in education. Practical Assessment, Research, and Evaluation, 27(1), 22.
IBM (2020). IBM SPSS Statistics for Windows, Version 27.0. IBM Corp.
Martinková, P., & Drabinová, A. (2018). ShinyItemAnalysis for teaching psychometrics and to enforce routine analysis of educational tests. R Journal, 10(2), 503-515.
Merkel, D. (2014). Docker: Lightweight linux containers for consistent development and deployment. Linux Journal, 2014(239), 2.
Microsoft Corporation. (2018). Microsoft Excel. Retrieved from https://office.microsoft.com/excel
Moncaleano, S., & Russell, M. (2018). A historical analysis of technological advances to educational testing: A drive for efficiency and the interplay with validity. Journal of Applied Testing Technology, 19(1), 1–19.
Morandat, F., Hill, B., Osvald, L., & Vitek, J. (2012). Evaluating the design of the R language: Objects and functions for data analysis. In ECOOP 2012–Object-Oriented Programming: 26th European Conference, Beijing, China, June 11-16, 2012. Proceedings 26 (pp. 104-131). Springer Berlin Heidelberg.
Muraki, E., & Bock, R.D. (2003). PARSCALE 4 for Windows: IRT based test scoring and item analysis for graded items and rating scales [Computer software]. Scientific Software International, Inc.
Oranje, A., & Kolstad, A. (2019). Research on psychometric modeling, analysis, and reporting of the national assessment of educational progress. Journal of Educational and Behavioral Statistics, 44(6), 648-670. https://doi.org/10.3102/1076998619867105
R Core Team (2022). R: Language and environment for statistical computing. (Version 4.2.1) [Computer software]. Retrieved from https://cran.r-project.org.
Reise, S.P., Ainsworth, A.T., & Haviland, M.G. (2005). Item response theory: Fundamentals, applications, and promise in psychological research. Current directions in psychological science, 14(2), 95-101.
Rupp, A.A. (2003). Item response modeling with BILOG-MG and MULTILOG for Windows. International Journal of Testing, 3(4), 365 384. https://doi.org/10.1207/S15327574IJT0304_5
Russell, M. (2016). A framework for examining the utility of technology-enhanced items. Journal of Applied Testing Technology, 17(1), 20-32.
Rutkowski, L., Gonzalez, E., Joncas, M., & Von Davier, M. (2010). International large-scale assessment data: Issues in secondary analysis and reporting. Educational Researcher, 39(2), 142-151. https://doi.org/10.3102/0013189X10363170
Scalise, K., & Gifford, B. (2006). Computer-based assessment in e-learning: A framework for constructing" intermediate constraint" questions and tasks for technology platforms. The Journal of Technology, Learning and Assessment, 4(6).
Schauberger, P., & Walker, A (2022). openxlsx: Read, Write and Edit xlsx Files. https://ycphs.github.io/openxlsx/index.html, https://github.com/ycphs/openxlsx
Schleiss, J., Günther, K., & Stober, S. (2022). Protecting student data in ML Pipelines: An overview of privacy-preserving ML. In International Conference on Artificial Intelligence in Education (pp. 532-536). Springer, Cham.
Schloerke, B., & Allen, J. (2023). plumber: An API Generator for R. https://www.rplumber.io, https://github.com/rstudio/plumber
Schumacker, R. (2019). Psychometric packages in R. Measurement: Interdisciplinary Research and Perspectives, 17(2), 106-112. https://doi.org/10.1080/15366367.2018.1544434
Skiena, S.S. (2017). The data science design manual. Springer.
Sung, K.H., Noh, E.H., & Chon, K.H. (2017). Multivariate generalizability analysis of automated scoring for short answer items of social studies in large-scale assessment. Asia Pacific Education Review, 18, 425-437. https://doi.org/10.1007/s12564-017-9498-1
Thissen, D., Chen, W-H, & Bock, R.D. (2003). MULTILOG 7 for Windows: Multiple category item analysis and test scoring using item response theory [Computer software]. Scientific Software International, Inc.
Van Rossum, G., & Drake Jr, F.L. (1995). Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
Volante, L., & Ben Jaafar, S. (2008). Educational assessment in Canada. Assessment in Education: Principles, Policy & Practice, 15(2), 201 210. https://doi.org/10.1080/09695940802164226
Weber, B. G. (2020). Data science in production: Building scalable model pipelines with Python. CreateSpace Independent Publishing.
Wickham, H. (2022). stringr: Simple, consistent wrappers for common string operations. https://stringr.tidyverse.org.
Wickham, H., François, R., Henry, L., & Müller, K. (2022). dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org.
Wickham, H., & Girlich, M. (2022). tidyr: Tidy messy data. Retrieved from https://tidyr.tidyverse.org
Wise, S.L. (2018). Computer-based testing. In the SAGE Encyclopedia of Educational Research, Measurement, and Evaluation (pp. 341–344). SAGE Publications, Inc.
Ysseldyke, J., & Nelson, J.R. (2002). Reporting results of student performance on large-scale assessments. In G. Tindal & T.M. Haladyna (Eds.), Large-scale assessment programs for all students: Validity, technical adequacy, and implementation. (pp. 467-483). Routledge
Zenisky, A.L., & Sireci, S.G. (2002). Technological innovations in large-scale assessment. Applied Measurement in Education, 15(4), 337 362. https://doi.org/10.1207/S15324818AME1504_02

There are 52 citations in total.

Details

Primary Language	English
Subjects	Measurement Theories and Applications in Education and Psychology
Journal Section	Special Issue 2023
Authors	Ryan Schwarz This is me 0009-0004-5867-3176 Hatice Cigdem Bulut 0000-0003-2585-3686 Charles Anifowose This is me 0009-0006-2524-9613
Publication Date	December 27, 2023
Submission Date	June 30, 2023
Published in Issue	Year 2023 Volume: 10 Issue: Special Issue

Cite

APA	Schwarz, R., Bulut, H. C., & Anifowose, C. (2023). A data pipeline for e-large-scale assessments: Better automation, quality assurance, and efficiency. International Journal of Assessment Tools in Education, 10(Special Issue), 116-131. https://doi.org/10.21449/ijate.1321061

Article Files

Full Text

23824 23823 23825