Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Newton ch2

185 views

Published on

validity

Published in: Education
  • Be the first to comment

  • Be the first to like this

Newton ch2

  1. 1. THE EARLY YEARS OF VALIDITY 1800S-----1951 Maryam Bolouri
  2. 2. Major developments in England, France, Germany, and the USA  1836: matriculation examinations  1845: first in USA, the superiority of written exam over oral quiz  1853: India act for impartial selection for civil services  1858: local examinations in OXFORD and Cambridge  Development of statistical approach in Britain such as Spearman contribution
  3. 3. Major developments in England, France, Germany, and the USA  1904: Binet in France , development of a series of test to discriminate unmotivated and incapable children from the others  USA,Yerkes et al. development of intelligence test in army recruits  Purpose: bring scientific methods to the study of edu such as achievement test or development of mental tests  Problem: growing discontent regarding the unreliability of marks and unfair evaluation by human minds
  4. 4. Personal equation concern:  Solution: sentence completion,T/F items, MC selection…  Development of objective and standard based assessment (1st roots in USA and soV is the product of NA)  Led to the mushrooming publication of standard tests and research into test and testing from 1910--1920
  5. 5. The outcome of pre 1921  Structured and objective assessment  Distinction btw sub-domains of edu and psycho measurement 1. Professional communities: diagnosis, achievement, selection 2. Scientific communities: explore personality characteristics and innate differences  Distinction btw different types of tests (ling vs. performance-individual vs. group- written and standardized tests)  Recognition of CO.CO as a tool for judging the quality of tests
  6. 6. Post 1921 era  The term “V” began to take root in the lexicon of researchers and practitioners.  1911 Freeman: technique andV of test methods  1915Terman: evaluated theV of intelligence and IQ tests  1916 Starch: referred toV or fairness of measures  1916Thorndike: essentials of valid scale  1919 APA attempts for professional certification in response to use of mental tests by unqualified individuals
  7. 7. Post 1921 era  1921 NADER national asso of Directors of edu research: seek standardization and consistency among concepts and procedures (similar to APA attempts in 1895, 1906).  Regulations proposed by them: 1. Preparation and selection 2. Experimental org of test and instruction 3. Trail of tentative test 4. Final org of test 5. Final cond of test (scoring, tabulation and interpretation) 6. DetermineV 7. Determine R 8. Determine norms
  8. 8. 1st official definition of V  By NADER  Challenged to promote and develop new methods  1st classic definition ofV: The degree to which a test or examination measures what it purports to measure The idea of criterion was central to this and the dominant approaches were predictive or concurrent ones. Content consideration existed yet was not sig and robust 1915—1930 boom period: new tests multiplied like rabbits, being uncritical to the instruments and the results
  9. 9. Early years:  Over simplistic descriptions  Elaboration of insights that had been established before  Elevation of empirical evidence at the expense of logical analysis (dust-bowl empiricism) According to Shepard: 1920—1950: defense to test criterion correlations 1940s:V= predictive Co. CO According to Kane: criterion phase According to Cronbach: whole ofV theory: prediction
  10. 10. Some issues regarding early years: 1. We cannot ignore early years Theory of prediction descriptive and explanatory investigations  The omissions of early years discussion is counter productive and we shouldn’t teachV from the baseline of 1954.  Only with reference to the baseline of 1921 the transition fromTrinitarian conception ofV to present day theory can be understood.
  11. 11. Some issues regarding early years: 2. Too many seminal works In early years  There were too many seminal works that made impossible for a coherent tradition to emerge.  Each with new perspectives  1920s was prolific for edu measurement  Difference in perspectives among authors within sub domains as well as in different sub domains
  12. 12. Some issues regarding early years: 3. V in different ways and phases  Both wars influenced testing and validation.  Large implementation of mental testing and a method of scoring by stencil for rapid marking by Otis during 1st world war  The army α and β: military aptitude gave mental testing publicity and prestige  Mechanical test construction to predict criterion measures (blindly empirical)  This is only one side of this complex story from mid of 19th to 20th century (to 1952)
  13. 13. Prediction phase a caricature: 1) Widespread adoption of blindly empirical methods specifically aptitude testing for the army 2) The degradation of classic definition over time and the method forV measurement was mistaken for definition ofV. it consists of 3 stages a) Quality of measurement b) Degree of correlation btw test and criterion c) Co. Co btw the test and criterion
  14. 14. from a to b: 1922:McCall, only by correlations we know what test measures  Classic definition: discreteV and validation,  It was conceptual abstraction.  A hypothetical true proficiency rank as an absolute criterion  There is no single true proficiency rank but a range of ranks  No sense of prediction, just in terms of correlation btw actual test results and hypo proficiency
  15. 15. from a to b: 1922:McCall, only by correlations we know what test measures 2 methods to determine the correspondence: 1. Prolonged careful observation in real life situ  determine true proficiency and use it as criterion  rank students on the test correlate them 2. Rank pupils with known proficiency  rank on the test correlate them
  16. 16. Other approaches to develop criterion:  Expert or teacher judgment  Results of multiple existing tests measure the same thing  Results from specific tests
  17. 17. From b to c: change of criteria from conceptual abstraction to more concrete and pragmatic measures  Coefficient ofV= Co. Co btw the test and scores and criterion scores  V= observed agreement rather than a hypo agreement btw test scores and true proficiency  V= empirical correlation  There was no Q to the v of criterion scores!!  Fusion of definition and method  Underscored the use of test and each test has differentV with regard to the use
  18. 18. From b to c: change of criteria from conceptual abstraction to more concrete and pragmatic measures  Dominance of atheoretical definition  Distinction btw practicalV and factorialV  PracticalV: a test is valid for anything with which it correlates (Guilford, 1946)  There are 2 kinds ofV and the practicalV addresses the fundamental Q ofV  Undue emphasis on empirical evidence problem: inadequacy of definition and criterion problem
  19. 19. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes 1. School achievement –Walter Monroe V as multifaceted concept based on correlation and a conceptual definition ofV was expressed a. Objectivity in describing the performances (rater) b. Reliability ( Co of R, index of R, error of measurement, Co of correspondence, overlapping of grade groups) c. Discrimination (agreement with Normal curve d. Comparison with criterion measures e.V inference based on test structure and admin
  20. 20. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes  6 threats to valid interpretation: 1. Do the tasks require other abilities ? 2. Can the tasks be answered in a variety of methods? (other than the intended one) 3. Is the test administered under a variety of conds? 4. Do students continue to exe their ability across all tasks? 5. Are the tasks rep of the field of ability being measured? 6. Are all students given this opportunity?
  21. 21. Unitary conception of V:  Integration of multiple sources of empirical evidence and logical analysis  2 primary categories of sources of evidence: 1. Expert opinion vs. experimental Ruch 1929 2. Curricular vs. statistical – Ruch 1933  3 approaches to logical analysis: Ruch 1929 1. Competent person judgment on the appropriateness of content 2. Alignment of content with test book 3. Alignment of content with recommendation of national edu committees
  22. 22. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes Fundamental role:  extensive sampling in school achievement tests, random sampling from the field, or rep of the most important elements, measuring the same thing or attribute  Tests parallel to actual teaching  Centrality of logical analysis Problem: no field is perfectly homogeneous , so there would be always a certain degree of compromise Major innovation: Scaling, tests with different levels of difficulty items of a test were not selected based on content and rep effectively Problem: tension btw discrimination and sampling
  23. 23. From random sampling to restricted sampling It not possible to construct a robust measure of overall achievement based on weighted sampling of behavior across the entire achievement domain. So instead of rep sample we should tap the essence of achievement . So those items with high correlation to general achievement must be selected. Each item play a role contributing to the essence of general achievement attribute Items discriminate btw high and low students correlate high with criterion.
  24. 24. From random sampling to restricted sampling  V from curriculum viewpoint andV from general achievement view point need to arrive at a compromise.  A large unresolved tension can be detected throughout the study by Lindquist (1936)
  25. 25. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes  Tyler (1931):V in terms of usefulness of the test in measuring the attainment of course objectives  He was not opposed to empirical approach, but not impressed by the use ofT marks as empirical criterion  His suggestion: development of preliminary tests for each course objectives to help  1) creating comprehensive criterion measures  2) diagnostic purposes Then preparation of some practical tests to be validated by correlation
  26. 26. Tyler’s concerns: 1. Sampling 2. Test construction 3. Validity 4. Mental process, no distinction btw content of subj and the required mental process, and items test info not the interpretation or application of principles 5. Negative impacts of tests on instruction and the reform of curriculum. Studying and teaching were adapted to the emphasis of tests
  27. 27. Tension btw empirical and logical 1930s-1940s  Overemphasis on empirical: inadequacy of criteria for establishingV and backwash effect on teaching and learning  Overemphasis on logical: impossibility of rep sampling and fallibility of human judgement  Tyler: rational hypo in test construction  Pendulum swings against empirical considerations (technician viewpoint)
  28. 28. 2 key principles in evaluation movement 1. The evaluation could not begin until the curriculum had been defined in terms of behavioral objectives 2. Any useful device might be employed in the production of pupil growth account:  Teacher judgment  Essay examination  Objective test
  29. 29. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes Logical approach: Raw brain power and Binet-Simon scales were extended. Problem: thorough description of the universe of intelligent behavior was not straightforward, there was no clear definition Binet: faculties are different from general intelligence , a single test can be a test of intelligence. Post-Binet: not a single test, but combined tests (manifold and heterogeneous) performance on a test is the product of both faculties and general intelligence.
  30. 30. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes  Solution: permissive sampling, assess considerably more than the essence of intelligence  V can be maximized by intentional construct under-representation or intentional construct- irrelevance  Assumption: random irrelevant item variance cancel out in law of averages.
  31. 31. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes  Empirical approach:  Criterion measure of intelligence is needed  During 1st world war: a number of reputed tests of higher quality to be adopted as yardstick Otis group test: most valid Terman Group Miller Group test: least valid Army Alpha Cattell-1943: promoted F.A as an important validation technique and transform it from lay activity to scientific prax
  32. 32. Terman (1928) 3 primary concerns of edu and psycho measurement 1. achievement 2. intelligence 3. aptitudes  For the purpose of vocational guidance and selection  1st Assumption: aptitudes were stable, if not innate  2nd assumption: aptitudes differ across and within individuals along continua  Difference of aptitude measurement: the criterion was not sth of present but of the future.  Successful performance in vocation= exercise of skills and abilities that had not yet been developed.  Problem? How should it be validated??
  33. 33. Empirical approach of Aptitude test:  The idea of sampling is meaningless so it led to elevation of empirical approaches in 4 stages: 1) Administer the aptitude test 2) Wait until the required skills and abilities are received 3) Assess job proficiency in situ 4) Correlate the result of tests and assessment of job proficency
  34. 34. Empirical approach of Aptitude test:  Absence of clear rational principles  Development based on haphazard trial and error search for effective predictors  With minimum rationality  Large list of preference to discriminate btw professions  Selection of items with high correlation to criterion in successive fashion (multiple regression challenge) low inter item correlation and high correlation with criterion (weakness of aptitude test)
  35. 35. Achilles heel of aptitude testing  Robust criterion measures  V for criterion measures  2 major components of criterion problem: 1. The definition of criterion, subjective judgment and widespread lack of agreement over occupational success 2. The development of a procedure to measure the criterion
  36. 36. Thorndike (1949): 3 categories of criteria 1. Ultimate category: complete final goal of a particular type of selection, multifaceted and not available for direct study 2. Intermediate category 3. Immediate category Validation will fall back on no 2, 3 Blind empiricism is fragile, dangerous. It was repeatedly said by Messick 1970s—1990s
  37. 37. Mid 1940s: Paul Meehl and Lee Cronbach, construct V Paul Meehl: Dissatisfied by client self-rating Self rating should not be used as a behavior surrogate but as an indirect sign of sth deeper Because it requires 1. Appropriate level of self understanding 2. Willingness to disclose
  38. 38. Mid 1940s: Paul Meehl and Lee Cronbach, construct V  Lee Cronbach: Impact of item format Response set: the tendency to respond differently to items in different ways 6 kinds of response: Give many responses, Speed, Accuracy, Gamble… A threat toV:different individuals demonstrate different response set on same set Solution: useT/F less and MC more
  39. 39. Cronbach (1949): 5 technical criteria of a good test 1. Validity 2. Reliability 3. Objectivity 4. Norms 5. Good items 2 approaches of logical analysis (psychological understanding of attribute) and empirical evidence
  40. 40. Cronbach (1949): V as the correspondence of test to definition of attribute There are items that correspond to definition of attribute yet bring irrelevant variables that make the items impure: 1. Items with different answers of test takers using different methods 2. Items with limited access to some test takers from certain cultural groups 3. Items that are vulnerable to response sets 4. Items correspond to content yet fail to assess desired processes
  41. 41. Cronbach (1949): ultimate consideration 1. Logical analysis is inferior to empirical evidence. 2. Most frequently used criterion: instructor or supervisors rating, others tests of the same attribute 3. Discussed criterion problem in-depth 4. Rise of particular empirical approach : factorialV, the degree that a test could purely measure one type of ability

×