Algo for converting text data to feature vector
Answers
Answered by
0
There are many methods for extract features from text data (Specially for Web and Email Categorization) you can find in the literature. There is also some beginning steps you can follow:
1. Trimming Vocabulary:
remove “non-content” words (very frequent “stop words” such as “the”, “and”, ... or very rare words e.g. , that just occur 10 times in 100000 words)
2. Stemming:
Reduce all variants of a word to a single term (e.g., {see, saw, seen} -> “see”)
3. Define Classes:
Define how many classes do you have? for example you may have classes: Spam, Not Spam; or in news categorization Sport, Science, Politics, ...; and so many other classes. Depended on your classes you can categorize and assign number to the words. Then count the frequency (number of occurrence) of each word and then multiply it by its number or score.
Now you have a feature vector almost the size of your classes and you can simply classify them in an standard way or in a if-then-else procedure (similar to decision tree) you build by your own
1. Trimming Vocabulary:
remove “non-content” words (very frequent “stop words” such as “the”, “and”, ... or very rare words e.g. , that just occur 10 times in 100000 words)
2. Stemming:
Reduce all variants of a word to a single term (e.g., {see, saw, seen} -> “see”)
3. Define Classes:
Define how many classes do you have? for example you may have classes: Spam, Not Spam; or in news categorization Sport, Science, Politics, ...; and so many other classes. Depended on your classes you can categorize and assign number to the words. Then count the frequency (number of occurrence) of each word and then multiply it by its number or score.
Now you have a feature vector almost the size of your classes and you can simply classify them in an standard way or in a if-then-else procedure (similar to decision tree) you build by your own
Answered by
0
Answer:
jfsjgs
jtdlhfljglfusjtsktsykdkykhdoy
Similar questions