Expenses categorizing with bank-learn

Once a while, I feel the need to analyze my expenses to understand my own costs structure (and maybe identify overspend). My bank portal does a rude expenses categorization per account over 3 months, while I prefer having a global view over an arbitrary long period of time, like 1 year. Moreover, I prefer avoiding the use of a third party (i.e fintech) tool to analyze my expenses (private data). LibreOffice spreadsheets do a decent job at filtering and aggregating data visually; one can generate charts as well if necessary. My bank portal allows extracting accounts transactions in csv format, which can be imported directly into LibreOffice. However, these extracts lack (critical) categorization information (i.e what category each expense belongs to).

bank-learn

This is where bank-learn comes into play. Based on scikit-learn Python module, bank-learn is a tool that can enrich (and possibly aggregate) bank csv extracts with a transaction category, based on a training set built either manually or using the tool. A training set is nothing more that a chunk of csv extract, with an additional category column that has been specified manually (arbitrary category names can be used). In order to ease the process of building the training set, bank-learn allows to interactively identify transactions in wrong categories and manually categorize them to enrich the training set, and eventually obtain a better categorization.

Text classifier setup

bank-learn is basically based on the scikit-learn Working With Text Data tutorial. At the core of the tool is a text classifier Pipeline, composed of:

  • a CountVectorizer, to extract numerical features that classifiers can use from text strings
  • a MultinomialNB classifier, which is a "Naive Bayes classifier for multinomial models"
self.__vectorizer = CountVectorizer(
    stop_words=STOP_WORDS,
    token_pattern= '(?u)\\b\\w[a-zA-Z0-9_\\-\\.]+\\b',
    ngram_range=(1,3),
)
self.__text_clf = Pipeline([
    ('vect', self.__vectorizer),
    ('clf', MultinomialNB()),
])

The CountVectorizer uses a custom stop_words list, containing words appearing often in our bank transactions descriptions, but that aren't relevant for our classifier to make good decisions (thus introducing noise from the classifier point of view). These words will be ignored by our vectorizer, when extracting the features used by the classifier. The stop_words list needs to be customized according to the country and possibly to the bank as well (depending on the words it uses to describe its transactions). The tool may work with the default stop_words list (or without any), but would probably need a bigger training set than if using a good stop_word list.

The token_pattern has been customized as well to capture '.' and '-' characters within words (bank transaction fields often use abbreviations containing '.' and/or '-' characters, and we don't want to split these abbreviations into one character words). This is a Python re regular expression.

The nrgam_range=(1,3) setting allows considering chunks of 1, 2 and 3 words; respectively called unigrams, bigrams and trigrams. This helps the classifier when successive words of a transaction description taken together have a meaning that isn't conveyed by the individual words; for instance a bar named "the little bar" can be identified by the classifier if it considers trigrams, while "the", "little" and "bar" words don't allow to discriminate as efficiently as the trigram.

The classifier itself is a multinomial Naive Bayes classifier, which is the first one proposed in the tutorial. It seems to work pretty well to classify text strings, so I didn't feel the need to experiment with other classifiers (maybe some day).

Text classifier usage

Once a scikit-learn classifier has been built (cf previous section), its usage is straight forward:

self.__text_clf.fit(self.__training_set_x, self.__training_set_y)
self.__prediction = self.__text_clf.predict(self.__corpus)

The classifier fit method, trains it with a training set split into (uncategorized) transactions self.__training_set_x and the corresponding categories self.__training_set_y; both of them are lists of strings.

Once the classifier trained, its predict method takes a list of (uncategorized) transactions (the transactions that we want to categorize, and that don't necessarily appear in our training set) and generates the list of the corresponding categories. It then suffices to format the data and generate the output csv file, with the transactions together with their categories.

Results and thoughts

With a training set of 270 transactions "manually" categorized using the tool, it is able to automatically categorize my 1200 transactions in a fraction of second, without any mistake (that I could see).

That said, I think that such classification could have been achieved with a roughly equivalent set of basic rules based on strings matching (i.e if the description contains this substring, put the transaction into that category). That's actually how I started the implementation of this tool, then I switched to using scikit-learn for the fun of discovering this machine learning kit, and because writing strings matching rules was boring.