Math, asked by muneer8488, 1 year ago

Data matix , interval scaled variable in cluster analysis

Answers

Answered by Swetankan

TYPE OF DATA IN CLUSTERING ANALYSIS

Data structure Data matrix (two modes) object by variable Structure

Dissimilarity matrix (one mode) object –by-object structure

We describe how object dissimilarity can be computed for object by Interval-scaled variables,

Binary variables, Nominal, ordinal, and ratio variables, Variables of mixed types

Interval-Scaled variables (continuous measurement of a roughly linear scale) Standardize data

Using mean absolute deviation is more robust than using standard deviation

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures

Binary Variables

A contingency table for binary data

Distance measure for symmetric binary variables:

Distance measure for asymmetric binary variables:

Jaccard coefficient (similarity measure for asymmetric binary variables):

Categorical variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m: # of matches, p: total # of variables

Method 2: use a large number of binary variables

creating a new binary variable for each of the M nominal states

Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled

replace xif by their rank

map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable

compute the dissimilarity using methods for interval-scaled variables

Ratio-scaled variable:

a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or

Ae-Bt

Methods:

treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)

apply logarithmic transformation yif = log(xif)

treat them as continuous ordinal data treat their rank as interval-scaled

Variables of Mixed Types

A database may contain all the six types of variables symmetric binary, asymmetric binary,

nominal, ordinal, interval and ratio

One may use a weighted formula to combine their effects

Vector Objects

Vector objects: keywords in documents, gene features in micro-arrays, etc.

Broad applications: information retrieval, biologic taxonomy, etc.

Cosine measure

TYPE OF DATA IN CLUSTERING ANALYSIS

Data structure Data matrix (two modes) object by variable Structure

Dissimilarity matrix (one mode) object –by-object structure

We describe how object dissimilarity can be computed for object by Interval-scaled variables,

Binary variables, Nominal, ordinal, and ratio variables, Variables of mixed types

Interval-Scaled variables (continuous measurement of a roughly linear scale) Standardize data

Using mean absolute deviation is more robust than using standard deviation

Similarity and Dissimilarity Between Objects

Distances are normally used to measure the similarity or dissimilarity between two data objects

Some popular ones include: Minkowski distance:

Also, one can use weighted distance, parametric Pearson product moment correlation, or other dissimilarity measures

Binary Variables

A contingency table for binary data

Distance measure for symmetric binary variables:

Distance measure for asymmetric binary variables:

Jaccard coefficient (similarity measure for asymmetric binary variables):

Categorical variables

A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matching

m: # of matches, p: total # of variables

Method 2: use a large number of binary variables

creating a new binary variable for each of the M nominal states

Ordinal Variables

An ordinal variable can be discrete or continuous

Order is important, e.g., rank

Can be treated like interval-scaled

replace xif by their rank

map the range of each variable onto [0, 1] by replacing i-th object in the f-th variable

compute the dissimilarity using methods for interval-scaled variables

Ratio-scaled variable:

a positive measurement on a nonlinear scale, approximately at exponential scale, such as AeBt or

Ae-Bt

Methods:

treat them like interval-scaled variables—not a good choice! (why?—the scale can be distorted)

apply logarithmic transformation yif = log(xif)

treat them as continuous ordinal data treat their rank as interval-scaled

Variables of Mixed Types

A database may contain all the six types of variables symmetric binary, asymmetric binary,

nominal, ordinal, interval and ratio

One may use a weighted formula to combine their effects

Vector Objects

Vector objects: keywords in documents, gene features in micro-arrays, etc.

Broad applications: information retrieval, biologic taxonomy, etc.

Cosine measure

Answered by 27swatikumari

Answer:

Matrix data:

This represents p dimensions or attributes, such as age, height, weight, gender, race, and so on, for n objects, such as people. The structure is presented as an n-by-p matrix or relational table (n objects x p variables)

Since the rows and columns of the data matrix represent the various entities, it is frequently referred to

as a two-mode matrix.

Interval scaled variable:

Continuous measurements with a correspondingly linear scale are known as interval-scaled variables.

Weight and height, latitude and longitude (for example, when clustering houses), and weather temperature are typical examples.

The clustering analysis may be impacted by the chosen measurement unit. For instance, converting measurements for height from metres to inches or weight from kilos to pounds may result in a very different clustering structure.

Previous Question

Next Question