Text Categorization of South Asian Last Names

Sabyasachi Guharay
Advisor: Professor David Madigan

Abstract

This project deals with implementation of statistical methods with the goal of classifying last names. Particularly, we have studied methods which are suitable for identifying South Asian last names embedded in a cluster comprising names of many other origins. We used an initial training data set of 10,000 names in which 5,000 are South Asian and 5,000 are non-South Asian. We implemented algortihms such as Naive-Bayes and Support Vector Machines in order to predict, whether a name is South Asian or not, with well-defined confidence level. These algorithms have been tested with trial cases. We obtained preliminary results on a randomly sampled test data set. Further analysis of these results are needed from the standpoint of error thresholding.

Further Details about the project

The basic idea of this project deals with developing computational algorithms to classify certain kinds of text. Human beings can naturally classify text. For example, humans from a priori knowledge can easily tell if certain text is for example "Old English" or "Modern English". Is this because all people have mastered both of these languages? No, but rather it is because the human brain uses "trigger" words to identify or classify these types of text. For example, the word "thou" and "thee" are perhaps dead give aways that the sentence with the above two words are not using "modern" English. While one approach could very well be to create a dictionary of all possible words, and then let the human brain search through this entire dictionary. While this may seem like a "perfect" (or flawless perhaps) solution, it is certainly not efficient. Even for the human brain, this inefficient. The brain has to go through every single case and choose to either accept or reject. Even for the human brain this can be quite tedious. If we wish to implement this on the computer, this will be similar to a "brute-force" algorithm in computer science. From computer science, we know that this is not something that anyone would wish to pursue. Therefore, one needs to come up with better types of algorithms so that one can efficiently classify the text.

The first question to think about next, is the following: How do we want to subclassify the text? For example, one could classify a document as either long or short based on a certain word length criterion. Since this is classifying into 2 groups, we call this the following: binary classification . This is the simplest possible classification scheme. There are obviously other possible classification schemes such as ternary (three groups such as Yes, No or Maybe), quatranery (four groups), etc. The further the amount of classification that the user wishes to make, the more complex of an algorithm in general that has to be developed. This motivates the general problem for us to consider.

We are overall interested in classifying South Asian last names. The immediate question that one can ask is the following: Why South Asian Last names? This problem has important medical implications. The origin for this problem came from a member of the medical community. The following "popular article" gives some motivation for how South Asians in particular are correlated to heart disease: South Asians and Heart Disease
. A brief glance at this article will mention how there is evidence of correlation between South Asians linked in particular to heart disease. This motivates a statistical study of determining algorithms to properly identify South Asian last names. The reason for this is that in the case that a doctor gets a large patient list and then finds that a subset of that list incurred heart disease, he won't then have to manually check the racial identity of those with heart disease. The doctor could use a computational algorithm (with a certain predictability) and determine the race of the subset (here it would be how many in the subset are South Asian or not). This would automate the process of identifying the race of the victims who have this disease. Now there are many notable issues whenever we are dealing with classifying South Asian versus Non-South Asian. For one, there are sometimes many similarities between South Asians and other Non-South Asian names. For example, the last name of "Ray" matches as a South Asian name and it also can match as a part (or whole) of an American Name. So which is it? Well, the human brain can rather easily distinguish this quagmire. If the name is "Satyajit Ray", then it is clear to most people that this is a South Asian name. However the name "Ray Bardham", is clearly NOT a South Asian name, but rather an American name. Therefore, this one "trivial" example, shows some of the complexity that we have to deal with when we are working with problems of this type. Another problem that one usually faces with South Asian names is that they tend to be quite long on the average. Of course this is not always true as pointed out by the aforementioned example of "Ray" as a South-Asian last name. So our "tuplets" have to be adjusted accordingly.

Now there has been some research work in tabulating a source for all South Asian names. For example CEEHD (Centre for Evidence in Ethnicity, Health and Diversity) has compiled several programs which have a listing of all South Asian names: CEEHD work
. However, CEEHD has pointed out that there are many areas where there can be improvement in this field. Thus, this problem is in no-way complete. This provided some of the motivation to pursue this research work. There are plenty of research papers on text classification. One can see the reference list below or the related links to get a further complete picture.

For details on our methodology please see the Final Presentation file below.

Some Pertinent files

Text file for tuplets (keys): Text file
Text File for South Asian Names: Text File
Text File for Non-South Asian Names: Text File
Perl Program which computes key scores: Text File

Final Presentation (7/24/03): PDF File

References:

  • E. Bauer., & R. Kohavi. (1999). Machine Learning, 36, 105-139.

  • L. Breiman (1996b). Machine Learning, 24, 123-140.

  • W. Cohen & Y. Singer (1999). Proceedings of Annual Conference of American Association for AI (pp. 335-342).

  • K. Knight (1999). Communic. ACM 42 , (pp. 58-61).

  • N. Fuhr & C. Buckley (1991) ACF Trans. Inform. Syst. 9 (pp. 223-248).

  • N. Fuhr & U. Pfeifer (1994). ACD Trans. Inform. Syst. 12 (pp. 92-115).

  • N.J. Belkin & W.B. Croft (1992) Commun. ACM 35 (pp. 29-38).

  • S. Weiss & I. Kapouleas (1989) Proceedings of International Joint Conference on AI (pp. 781-787).

  • S. Weiss & N. Indurkhya (1993) IEEE EXPERT 8 (pp. 61-69).

  • S. Weiss & N. Indurkhya (1998) Predictive Data Mining: A Practical Guide Morgan Kaufmann. DMSK Software: www.data-miner.com.

  • White A. Social Focus in Brief: Ethnicity 2002 . Office for National Statistics. London.
    Related Links:

    Lightweight Induction Rule

    IBM Software

    Machine Learning Paper