Text classification (a.k.a. text categorization) is one of the most prominent application of Machine Learning. The purpose of text classification is to give conceptual organization to large collection of documents.An interesting application of text classification is to categorize research papers by most suitable conferences. Finding and selecting a suitable academic conference has always been a challenging task especially for young researchers. We can define a ‘suitable academic conference’ as a conference, which is aligned with the researcher’s work and have a good academic ranking. Usually researcher have to consult to their supervisors and search extensively to find a suitable conference. Among many conferences, few are considered to be relevant to send a research work. To fulfil editorial and content specific demands of conferences, researcher needs to go through the previously published proceedings of a certain conference. Based on previous proceeding of a conferences, the research work is sometimes modified to increase the chances of article acceptances and publication. This problem can be solved to some extent using machine learning techniques e.g. classification algorithms like SVM, Naïve Bayes, etc.
Thus, the objective of this tutorial is to provide hands on experience on how to perform text classification using conference proceedings dataset. We will learn how to apply various classification algorithms to categorize research papers by conferences along with feature selection and dimensionality reduction methods using popular scikit-learn library in Python.
Read full article with source code.