

Founded in 2007 by Michael Arrington, originally, it was the data storage for its mother company TechCrunch. ).Ĭrunchbase is the largest companies’ database in the world, containing a large variety of up-to-date information about each company. Other well-known text classification tasks, nowadays receiving increasingly importance, include sentiment analysis and emotion detection, that consist of assign a positive/negative sentiment or an emotion to a text (e.g. E-mail spam detection is one of the most well-known applications of text classification, where the main goal consists of automatically assigning one of two possible labels (spam or ham) to each message. Text classification may be considered a relatively simple task, but it plays a fundamental role in a variety of systems that process textual data. This creates the need of processing all this data in order to be able to collect useful information from it. We live in a digital society where data grows day by day, most of it consisting of unstructured textual data. The resulting models may constitute an important asset for automatic classification of texts, not only consisting of company descriptions, but also other texts, such as web pages, text blogs, news pages, etc.

In a second set of experiments, a multiclass problem that attempts to find the most probable category, we obtained about 67% accuracy using SVM and Fuzzy Fingerprints. Our findings reveal that the description text of each company contain features that allow to predict its area of activity, expressed by its corresponding categories, with about 70% precision, and 42% recall. This is a highly unbalanced dataset, where the frequency of each category ranges from 0.7% to 28%. A number of natural language processing strategies have been tested for feature extraction, including stemming, lemmatization, and part-of-speech tags. Each company is labeled with one or more categories, from a subset of 46 possible categories, and the proposed models predict the categories based solely on the company textual description. This paper compares different models for multilabel text classification, using information collected from Crunchbase, a large database that holds information about more than 600000 companies.
