The basic idea of this project deals with developing computational algorithms to classify certain kinds of text. Human beings can naturally classify text. For example, humans from a priori knowledge can easily tell if certain text is for example "Old English" or "Modern English". Is this because all people have mastered both of these languages? No, but rather it is because the human brain uses "trigger" words to identify or classify these types of text. For example, the word "thou" and "thee" are perhaps dead give aways that the sentence with the above two words are not using "modern" English. While one approach could very well be to create a dictionary of all possible words, and then let the human brain search through this entire dictionary. While this may seem like a "perfect" (or flawless perhaps) solution, it is certainly not efficient. Even for the human brain, this inefficient. The brain has to go through every single case and choose to either accept or reject. Even for the human brain this can be quite tedious. If we wish to implement this on the computer, this will be similar to a "brute-force" algorithm in computer science. From computer science, we know that this is not something that anyone would wish to pursue. Therefore, one needs to come up with better types of algorithms so that one can efficiently classify the text.
The first question to think about next, is the following: How do we want to subclassify the text? For example, one could classify a document as either long or short based on a certain word length criterion. Since this is classifying into 2 groups, we call this the following: binary classification . This is the simplest possible classification scheme. There are obviously other possible classification schemes such as ternary (three groups such as Yes, No or Maybe), quatranery (four groups), etc. The further the amount of classification that the user wishes to make, the more complex of an algorithm in general that has to be developed. This motivates the general problem for us to consider.
We are overall interested in classifying South Asian last names. The
immediate question that one can ask is the following: Why South Asian
Last names? This problem has important medical implications. The
origin for this problem came from a member of the medical
community. The following "popular article" gives some motivation for
how South Asians in particular are correlated to heart disease: South Asians and Heart Disease
. A
brief glance at this article will mention how there is evidence of
correlation between South Asians linked in particular to heart
disease. This motivates a statistical study of determining algorithms
to properly identify South Asian last names. The reason for this is
that in the case that a doctor gets a large patient list and then finds that a
subset of that list incurred heart disease, he won't then have to
manually check the racial identity of those with heart disease. The
doctor could use a computational algorithm (with a certain
predictability) and determine the race of the subset (here it would be
how many in the subset are South Asian or not). This would automate
the process of identifying the race of the victims who have this
disease. Now there are many notable issues whenever we are dealing
with classifying South Asian versus Non-South Asian. For one, there
are sometimes many similarities between South Asians and other
Non-South Asian names. For example, the last name of "Ray" matches as
a South Asian name and it also can match as a part (or whole) of an
American Name. So which is it? Well, the human brain can rather easily
distinguish this quagmire. If the name is "Satyajit Ray", then it is
clear to most people that this is a South Asian
name. However the name "Ray Bardham", is clearly NOT a
South Asian name, but rather an American name. Therefore, this one
"trivial" example, shows some of the complexity that we have to deal
with when we are working with problems of this type. Another problem
that one usually faces with South Asian names is that they
tend to be quite long on the average. Of course this is not always
true as pointed out by the aforementioned example of "Ray" as a
South-Asian last name. So our "tuplets" have to be adjusted accordingly.
Now there has been some research work in tabulating a source for all
South Asian names. For example CEEHD (Centre for Evidence in
Ethnicity, Health and Diversity) has compiled several programs which
have a listing of all South Asian names:
CEEHD work
. However, CEEHD has pointed out that there are
many areas where there can be improvement in this field. Thus, this
problem is in no-way complete. This provided some of the motivation to
pursue this research work. There are plenty of research papers on text
classification. One can see the reference list below or the related
links to get a further complete picture.
For details on our methodology please see the Final Presentation file below.
Final Presentation (7/24/03): PDF
File
Related Links: