Scaling is the measurement of a variable in such a way that it can be expressed on a continuum. Rating your preference for a product from 1 to 10 is an example of a scale.

With comparative scaling, the items are directly compared with each other (example : Do you prefer Pepsi or Coke?). In noncomparative scaling each item is scaled independently of the others (example : How do you feel about Coke?).

Composite measures

Indexes are similar to scales except multiple indicators of a variable are combined into a single measure. The index of consumer confidence, for example, is a combination of several measures of consumer attitudes. A typology is similar to an index except the variable is measured at the nominal level. Scaling, indexes, and typologies are all examples of composite measures.

Data types

The type of information collected can influence scale construction. Different types of information are measured in different ways. See in particular level of measurement.Some data is measured at the nominal level. That is, any numbers used are mere labels : they express no mathematical properties. Examples are SKU inventory codes and UPC bar codes.
Some data is measured at the ordinal level. Numbers indicate the relative position of items, but not the magnitude of difference. An example is a preference ranking.
Some data is measured at the interval level. Numbers indicate the magnitude of difference between items, but there is no absolute zero point. Examples are attitude scales and opinion scales.
Some data is measured at the ratio level. Numbers indicate magnitude of difference and there is a fixed zero point. Ratios can be calculated. Examples include: age, income, price, costs, sales revenue, sales volume, and market share.

Scale construction decisions

What level of data is involved (nominal, ordinal, interval, or ratio)?
What will the results be used for?
Should you use a scale, index, or typology?
What types of statistical analysis would be useful?
Should you use a comparative scale or a noncomparative scale?
How many scale divisions or categories should be used (1 to 10; 1 to 7; -3 to +3)?
Should there be an odd or even number of divisions? (Odd gives neutral center value; even forces respondents to take a non-neutral position.)
What should the nature and descriptiveness of the scale labels be?
What should the physical form or layout of the scale be? (graphic, simple linear, vertical, horizontal)
Should a response be forced or be left optional?

Comparative scaling techniques

Paired comparison scaling - a respondent is presented with two items at a time and asked to select one (example : Do you prefer Pepsi or Coke?). This is an ordinal level technique when a measurment model is not applied. The Pairwise comparison model can be applied in order to derive measurments provided the data derived from paired comparisons possess an appropriate structure. Thurstone's Law of comparative judgment can also be applied in such contexts.
Rasch scaling - respondents interact with items and comparisons are inferred between items from the responses. This involves application of the Rasch model to derive measurements. The Rasch model has an identical structure to the Pairwise Comparison model but contains a person parameter.
Rank-order scaling - a respondent is presented with several items simultaneously and asked to rank them (example : Rate the following advertisements from 1 to 10.). This is an ordinal level technique.
Constant sum scaling - a respondent is given a constant sum of money, script, credits, or points and asked to allocate these to various items (example : If you had 100 Yen to spend on food products, how much would you spend on product A, on product B, on product C, etc.). This is an ordinal level technique.
Bogardus social distance scaling - measures the degree to which a person is willing to associate with a class or type of people. It asks how willing the respondent is to make various associations. The results are reduced to a single score on a scale. There are also non-comparative versions of this scale.
Q-Sort scaling - Up to 140 items are sorted into groups based a rank-order procedure.
Guttman scaling - This is a procedure to determine whether a set of items can be rank-ordered on an unidimensional scale. It utilizes the intensity structure among several indicators of a given variable. Statements are listed in order of importance. The rating is scaled by summing all responses until the first negative response in the list.

Non-comparative scaling techniques

Continuous rating scale (also called the graphic rating scale) - respondents rate items by placing a mark on a line. The line is usually labeled at each end. There are sometimes a series of numbers, called scale points, (say, from zero to 100) under the line. Scoring and codification is difficult.
Likert Scaling - Respondents are asked to indicate the amount of agreement or disagreement (from strongly agree to strongly disagree) on a five-point scale. The same format is used for multiple questions.
Semantic differential scaling - Respondents are asked to rate on a 7 point scale an item on various attributes. Each attribute requires a scale with bipolar terminal labels.
Stapel scaling - This is a unipolar ten-point rating scale. It ranges from +5 to -5 and has no neutral zero point.
Thurstone scaling - This is a scaling technique that incorporates the intensity structure among indicators.
Mathematically derived scaling - Researchers infer respondents’ evaluations mathematically. Two examples are multi dimensional scaling and conjoint analysis.

Scale evaluation

Scales should be tested for reliability, generalizability, and validity. Generalizability is the ability to make inferences from a sample to the population, given the scale you have selected. Reliability is the extent to which a scale will produce consistent results. Test-retest reliability checks how similar the results are if the research is repeated under similar circumstances. Alternative forms reliability checks how similar the results are if the research is repeated using different forms of the scale. Internal consistency reliability checks how well the individual measures included in the scale are converted into a composite measure.

Scales and indexes have to be validated. Internal validation checks the relation between the individual measures included in the scale, and the composite scale itself. External validation checks the relation between the composite scale and other indicators of the variable, indicators not included in the scale. Content validation (also called face validity) checks how well the scale measures what it is supposed to measure. Criterion validation checks how meaningful the scale criteria are relative to other possible criteria. Construct validation checks what underlying construct is being measured. There are three variants of construct validity. They are convergent validity, discriminant validity, and nomological validity. The coefficient of reproducibility indicates how well the data from the individual measures included in the scale can be reconstructed from the composite scale.