Data matix , interval scaled variable in cluster analysis
Answers
TYPE OF DATA IN CLUSTERING ANALYSIS
Data structure Data matrix (two modes) object by variable Structure
Dissimilarity matrix (one mode) object –by-object structure
We describe how object dissimilarity can be computed for object by Interval-scaled variables,
Binary variables, Nominal, ordinal, and ratio variables, Variables of mixed types
Interval-Scaled variables (continuous measurement of a roughly linear scale) Standardize data
Using mean absolute deviation is more robust than using standard deviation
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures
Binary Variables
A contingency table for binary data
Distance measure for symmetric binary variables:
Distance measure for asymmetric binary variables:
Jaccard coefficient (similarity measure for asymmetric binary variables):
Categorical variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable
compute the dissimilarity using methods for interval-scaled variables
Ratio-scaled variable:
a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or
Ae-Bt
Methods:
treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)
apply logarithmic transformation yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled
Variables of Mixed Types
A database may contain all the six types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio
One may use a weighted formula to combine their effects
Vector Objects
Vector objects: keywords in documents, gene features in micro-arrays, etc.
Broad applications: information retrieval, biologic taxonomy, etc.
Cosine measure
TYPE OF DATA IN CLUSTERING ANALYSIS
Data structure Data matrix (two modes) object by variable Structure
Dissimilarity matrix (one mode) object –by-object structure
We describe how object dissimilarity can be computed for object by Interval-scaled variables,
Binary variables, Nominal, ordinal, and ratio variables, Variables of mixed types
Interval-Scaled variables (continuous measurement of a roughly linear scale) Standardize data
Using mean absolute deviation is more robust than using standard deviation
Similarity and Dissimilarity Between Objects
Distances are normally used to measure the similarity or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures
Binary Variables
A contingency table for binary data
Distance measure for symmetric binary variables:
Distance measure for asymmetric binary variables:
Jaccard coefficient (similarity measure for asymmetric binary variables):
Categorical variables
A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matching
m: # of matches, p: total # of variables
Method 2: use a large number of binary variables
creating a new binary variable for each of the M nominal states
Ordinal Variables
An ordinal variable can be discrete or continuous
Order is important, e.g., rank
Can be treated like interval-scaled
replace xif by their rank
map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable
compute the dissimilarity using methods for interval-scaled variables
Ratio-scaled variable:
a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or
Ae-Bt
Methods:
treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)
apply logarithmic transformation yif = log(xif)
treat them as continuous ordinal data treat their rank as interval-scaled
Variables of Mixed Types
A database may contain all the six types of variables symmetric binary, asymmetric binary,
nominal, ordinal, interval and ratio
One may use a weighted formula to combine their effects
Vector Objects
Vector objects: keywords in documents, gene features in micro-arrays, etc.
Broad applications: information retrieval, biologic taxonomy, etc.
Cosine measure
Answer:
Matrix data:
This represents p dimensions or attributes, such as age, height, weight, gender, race, and so on, for n objects, such as people. The structure is presented as an n-by-p matrix or relational table (n objects x p variables)
Since the rows and columns of the data matrix represent the various entities, it is frequently referred to
as a two-mode matrix.
Interval scaled variable:
Continuous measurements with a correspondingly linear scale are known as interval-scaled variables.
Weight and height, latitude and longitude (for example, when clustering houses), and weather temperature are typical examples.
The clustering analysis may be impacted by the chosen measurement unit. For instance, converting measurements for height from metres to inches or weight from kilos to pounds may result in a very different clustering structure.