Great Careers in Testing

Alina von DavierAlina A. Von Davier, PhD

Alina A. von Davier is a Strategic Advisor in the Research & Development Division at Educational Testing Service (ETS) in Princeton, NJ. She is also an Adjunct Professor at Fordham University. Her Ph.D. in mathematics was earned at the Otto von Guericke University of Magdeburg, Germany, and her M.S. in mathematics is from the University of Bucharest, Romania.

Dr. von Davier is responsible for fostering research relationships between ETS and the psychometric field, nationally and internationally, and for providing psychometric support to several international assessments. She has led an ETS Research Initiative called "Equating and Applied Psychometrics" for the past seven years. During her tenure at ETS, Dr. von Davier was the Research Center Director of the Global Psychometric Services (GPS) Center. The GPS center supports the psychometric work for all large ETS international programs, such as TOEFL iBT® and TOEIC®. She co-authored a book on the kernel method of test equating and guest co-edited for a special issue on population invariance of linking functions for Applied Psychological Measurement. She edited a volume on new models for test equating and authored a book on testing causal hypotheses. Currently she is co-editing a book on multi-stage testing. She has also published research articles in several leading psychometrics journals. Prior to working for ETS, she worked in Germany at the Universities of Trier, Magdeburg, Kiel, and Jena, and the ZUMA, a research institute, in Mannheim. She also worked for the Institute of Psychology of the Romanian Academy, and was a visiting scholar at the Stockholm School of Economics.

Your Career

At what point did you become interested in psychometrics and how did you decide to take it up as your career focus?
As a graduate student in the Mathematics Department, I worked for the Institute of Psychology of the Romanian Academy. The institute provided a very dynamic and intellectually stimulating environment, especially after the communist era. During that period of time, I tried to find my place among my psychologist colleagues. I realized that I could help them run their analyses and perhaps even create new ways to answer substantive research questions using mathematical modeling and statistics.

Given your experience as the ETS Research Center Director of the Global Psychometric Services Center and of the Psychometric, Development, and Resources, what are some words of wisdom you can share about the process of making sure that tests, such as the TOEFL iBT®, do not discriminate against test takers in a given year or across varying populations?
Educational Testing Service has numerous procedures and processes in place to ensure test fairness. It even has a collection of published standards, called "ETS Standards for Quality and Fairness," to which we adhere. There are two sets of approaches to fairness: procedures followed to ensure that there is measurement invariance across subpopulations of test takers and procedures followed to ensure that the meaning of the test results across test forms and administrations is preserved.

The first set of procedures is common to all tests. The work starts with the development of items and tests where the developers avoid using terminologies that are found to negatively impact the test takers or groups of test takers. Then the items and the tests are reviewed after the test administration. Various statistical procedures are used to identify items that may adversely impact groups of test takers. If problematic items are identified, a panel of content experts and psychometricians discusses whether the items should remain on the test. Research studies are conducted to investigate the measurement invariance of the test results with respect to groups of test takers. These analyses may range from descriptive statistics to factor analysis or multigroup modeling.

The second set of procedures is specific to standardized assessments used to make high-stakes decisions, where multiple test forms are constructed to be as similar as possible, in terms of difficulty, content, specifications, etc. In standardized assessments, additional statistical procedures, such as equating, are necessary for the adjustment of any unintended differences in the difficulty between test forms. Test equating should lead to scores that are interchangeable between the equated test forms. However, equating is a statistical procedure that is very sensitive to misspecifications and deviations from the planned data collection. Research studies are conducted regularly to investigate the population invariance of the equating results with respect of subgroups of test takers. Equating results need careful monitoring over time and adjustment if drifts are suspected

Much of your work has been done on test equating -- could you please elaborate what test equating is and how it is both useful and important?
As I previously mentioned, test equating is the statistical procedure used to adjust for test form differences in a standardized assessment. Without test equating it would be impossible to maintain the meaning of test scores over time or to select candidates that take different test forms of the same assessment -- how would you know who to choose for your admissions decision, John or Jane, if both have the same total correct score, but took test forms that are not equally difficult? Perhaps John was lucky to receive an easier test form, while Jane took a more difficult test form. This is why test equating is so critical.

You co-authored a book called The Kernel Method of Test Equating (KE). What precisely is KE and how is it different from other equating methods?
KE is an observed-score equating method. It differs from other observed-score methods in two ways:

  1. It provides a unified approach to test equating; a framework to conceptualize and conduct the work. It includes all steps needed in the process of equating test forms, and accounts for the error introduced in the results at each step.
  2. KE uses continuous and differentiable functions as approximations to the discrete distributions of the test scores to compute the equating function. In contrast, the traditional methods use linear interpolations to approximate the discrete distributions, which result in continuous functions that are not differentiable. This difference becomes relevant in the equating results -- the KE results tend to be smoother, without gaps -- and in the analytical standard errors, which are defined mathematically and are quite accurate for the KE; the analytical standard errors are not defined mathematically for traditional equating methods.

You act as Adjunct Professor at Fordham University and prior to your career at ETS had been working at various universities throughout Germany and Romania. How does working in a commercial setting differ from an academic setting? Do you have a preference for one or the other?
There are some differences between academia and industry even when, in this case, the industry is represented by a non-profit organization. First, in one setting you need to affect learning; in the other you need to consult or analyze operational data. In academia, you are in charge of your research agenda and can work on research projects with your students, and you essentially lead the effort. You often have little support to get things done -- budgets, editorial work. When you work on research projects in a research-oriented organization, you need to align your research agenda with the interests of the institution and often collaborate with other researchers who are experts in their own right. Working for an organization requires one to employ additional negotiation and communication skills. On the other hand, in both academia and in a research-oriented institution, it matters that you care about your research and that you follow an ethical conduct. I like that ETS has very high professional standards and am fortunate to find much inspiration for my research in the operational work. I also like and appreciate the fact that ETS provides me with the flexibility to teach and work with students, which is always refreshing.

What career accomplishments are you most proud of and where would you like to take your career in the future?
I am proud of the work I have done in test equating. I tried to bring a statistical approach to the field of equating, and to move the operational work and the teachings of equating from craftsmanship to science. I implemented the KE to several operational programs. I was told that my edited volume inspired the development of two R-packages, which was very rewarding to hear. I've been called an expert in equating. This is, however, a bit scary. Shunryu Suzuki, who was the monk who helped popularize Zen Buddhism in the United States, wrote, "In the beginner's mind there are many possibilities. In the expert's mind there are few." Hence, I am trying to preserve a beginner's mind. I began conducting research in quality control and data mining as well as in test design, which includes work in multi-stage adaptive testing. I am more and more interested in assessing new constructs by taking the advantage of the advances in technology and computer sciences. Of course, I am still interested and invested in equating. All of these new assessment contexts require that we think in terms of measurement, fairness, and comparability, and hence, in a less familiar way, about equating.

If you had only a 20-word sentence to convey what you know and do best, what would it be?
I can create and implement a long-term, strategic research agenda in support of an idea/program in the area of educational measurement.

Education and Careers in Psychometrics

For individuals considering a career in psychometrics, what are some of the key qualities, skills, or talents you believe should be possessed in order to be successful in the field?
The field of psychometrics is changing. The field of education is changing, too. Intelligent tutors are becoming more prominent in the classroom as well as in homework, while online courses are transforming the mode of teaching students. In the psychometric field, we need more and more interdisciplinary skills and motivation for life-long learning. I believe that it is still important to know calculus, matrix algebra, and computer programming. It is also imperative to understand mathematical and statistical models, and the estimation methods available. I see an interest in the field in using stochastic processes to model students' learning over time or to model assessment process data (e.g., interaction with the resources or peers, time information, or key-stroke logging data, etc.) and data mining techniques to discover patterns in large data sets.

At the International Meeting of the Psychometric Society Conference in 2011, you mentioned that an important skill for a student in psychometrics to acquire is programming. Could you please elaborate on this?
I believe it is important to know how to program in a programming language. When you write your own code, you understand the details of the procedures and the differences between the formulas in their analytical form and their numerical instantiation. It also gives you the freedom to write your own code to research your own models, or at the very least, it gives you the chance to understand someone else's code and its implications. It provides you with insight into communicating with a computer; the relationship between input and output; on how you need "to say" something so that it is "done" the way you want it. I think it gives one an appreciation for the elegance of computer programs, with the main program and multiple routines or functions called by that main program, a taste for efficient code, for the commentaries in the code (that you will come to appreciate later), the need to have a guide to your own program, and maybe an appreciation of user interfaces. Once you have this knowledge, you are independent to a large extent of the evolution of the platforms, software development, etc. With minimal effort, you will know what to do if you need to write or decipher code in a language you haven't seen before.

In your opinion, is it an advantage to have a background in mathematics if working in the field of psychometrics?
I think that training in many disciplines, such as mathematics, statistics, computer science, quantitative psychology, quantitative educational measurement, or perhaps engineering could give you the tools you need or at least give you the confidence and knowledge to seek out and find the tools you need and the ability to learn how to use them.

How does expertise in test construction compare with the expertise of the subject matter in question when developing an assessment?
I believe it is very important to have a theory of what it is to be measured when you construct a test. In addition to subject matter expertise, you also need educational and psychological expertise. And, of course, psychometric expertise is a must-have. Psychologists often need to wear all these hats when they create psychological tests. In contrast, in educational tests, we assemble teams comprised of multiple experts.

The Past, Present and Future of Testing

What are some of the debates and conflicts that exist in the world of assessment? What types of organizations and interests are often at odds with one another?
I believe there is a gap between what we could do psychometrically and what we do in the existing assessments. There is also a gap between the way the education is changing and the way we construct and administer assessments. Assessment practice needs to catch up on the way education has evolved with essays being written (and scored) on computer; intelligent tutors being used as part of classroom instruction; online courses being seen as a substitute for traditional learning models; learning taking place via forums and chat rooms; collaborative problem solving being seen as a much-needed skill; and so forth. To move forward, the assessment community needs to partner with other types of institutions and industries, for example, the serious gaming industry.

In your estimation, what percentage of high-stakes tests are computerized adaptive tests? What percentage do you estimate it will be in 5-10 years?
I recommend the reader check the IACAT webpage. It is very informative. I expect that all tests will be adaptive in one way or another in the coming years. I think this transition will be hastened by the significant shift in the demographics of test takers, increasing the variance of the skills to be measured by tests.

What do you find are the advantages and disadvantages of computerized adaptive tests versus paper-and-pencil tests?
I believe that the computerized adaptive test design is the best design to handle a heterogeneous population of test takers. Since skills are learned and used in schools and in the workforce on the computer, it makes sense to use this medium to measure those skills.

Do you think an open-source online platform for creating computerized adaptive tests would benefit the psychometric community?
Yes, I believe that all software will be open source or free-ware in the next few years, including the ones for computerized adaptive tests.

What do you believe to be the next step when it comes to assessments such as the TOEFL iBT®? How have these tests evolved from the past to the present?
The TOEFL iBT® is a test of academic English. The test and its framework are regularly reviewed by content experts to ensure that they stay up-to-date in terms of the way the English language is used. The test went through a major revision in 2005. I believe that in the future, all tests will be used as a part of immersive virtual environments, comprised of complex tasks that integrate multiple skills and requiring solutions in real time.

A computer generating unique items on the spot with prescribed difficulty has been found to work when it comes to Raven's Progressive Matrices. Do you think this could also work with educational standardized assessments such as the TOEFL iBT®? In your opinion, would this increase test validity? Could such software diminish the phenomenon of test takers revealing the items of the test to other future test takers?
While I do think that more items will be generated automatically, I don't think the precision of the difficulty would be achieved in educational assessments. We found out that when we generate an item from an item family, we obtain a distribution of the difficulty for that item, with a mean and non-negligible variance. This means that while we can take the advantage of the automatic item generation (and scoring), we cannot yet ignore the role of equating or adjusting for the differences in test forms difficulties.

What do you believe is the primary value of testing in our society and why do you think it is so important to have standardized tests, such as the TOEFL iBT®, in place today?
I mentioned above the need for the assessment to support education in our society. There is an obvious move in society towards non-traditional learning experiences, such as online courses, peer-instruction, and distance learning. Hence, one needs to have an instrument that is standardized and easy to use in order to select the best students from all these various learning experiences for a particular academic program. Other non-cognitive sources of information will continue to be used for selection. What else could be easy to request, is affordable to test takers, and can provide a fair score to everyone but an educational standardized test?

Do you find that standardized tests are all that is needed for evaluating individual's skills and competencies?
A test score is a test score. A human being is more than a score. However, to ignore the predictability of a good test of success in college, for example, would be unwise. I recommend standardized tests as tools, used in combination with a variety of other measures for admissions into an institution. When you need to choose among students from many countries with different educational and cultural systems -- all of which may be unfamiliar to an admissions officer -- a test score can provide a differentiator.

awards has been honored with the 2020 Academics' Choice Smart Media Award, a prestigious seal of educational quality. The Academics' Choice Advisory Board consists of leading thinkers and graduates from Princeton, Harvard, George Washington University, and other reputable educational institutions. Our award is for no particular test but for our site and test preparation system as a whole.