Commit graph

8 commits

Author SHA1 Message Date
abp
c33591eaca instantiate and import spam classifier lazily
Co-authored-by: das <das@tutao.de>
2025-11-18 17:10:44 +01:00
map
5293be6a4a
Implement spam training data sync and add TutanotaModelV98
We sync the spam training data encrypted through our server to make
sure that all clients for a specific user behave the same when
classifying mails. Additionally, this enables the spam classification
in the webApp. We compress the training data vectors
(see clientSpamTrainingDatum) before uploading to our server using
SparseVectorCompressor.ts. When a user has the ClientSpamClassification
enabled, the spam training data sync will happen for every mail
received.

ClientSpamTrainingDatum are not stored in the CacheStorage.
No entityEvents are emitted for this type.
However, we retrieve creations and updates for ClientSpamTrainingData
through the modifiedClientSpamTrainingDataIndex.

We calculate a threshold per classifier based on the dataset ham to spam
ratio, we also subsample our training data to cap the ham to spam ratio
within a certain limit.

Co-authored-by: jomapp <17314077+jomapp@users.noreply.github.com>
Co-authored-by: das <das@tutao.de>
Co-authored-by: abp <abp@tutao.de>
Co-authored-by: Kinan <104761667+kibibytium@users.noreply.github.com>
Co-authored-by: sug <sug@tutao.de>
Co-authored-by: nif <nif@tutao.de>
Co-authored-by: map <mpfau@users.noreply.github.com>
2025-11-18 13:56:19 +01:00
das
f8bbd32695
Include header fields as tokens in the anti-spam
Add the header fields(sender, toRecipients, ccRecipients, bccRecipients,
authStatus) to the anti-spam vectors. We also improve some of the
preprocessing steps and add offline migrations by deleting old spam
tables

Co-authored-by: amm@tutao.de
Co-authored-by: jhm <17314077+jomapp@users.noreply.github.com>
2025-11-18 10:37:23 +01:00
das
0739a78691
Fix retraining right after initial training.
- The field lastTrainedTime was not set during initial training, this
led to the spamClassifier retraining on the second login.
2025-10-27 17:52:12 +01:00
abp
4e7c0f2fd5
do not try to train if there is no new data
Co-authored-by: map <mpfau@users.noreply.github.com>
2025-10-22 16:44:57 +02:00
abp
5124985d4f
remove DynamicTfVectorizer
Co-authored-by: map <mpfau@users.noreply.github.com>
2025-10-22 09:40:46 +02:00
sug
f11e59672e
improve inbox rule handling and run spam prediction after inbox rules
Instead of applying inbox rules based on the unread mail state in the
inbox folder, we introduce the new ProcessingState enum on
the mail type. If a mail has been processed by the leader client, which
is checking for matching inbox rules, the ProcessingState is
updated. If there is a matching rule the flag is updated through the
MoveMailService, if there is no matching rule, the flag is updated
using the ClientClassifierResultService. Both requests are
throttled / debounced. After processing inbox rules, spam prediction
is conducted for mails that have not yet been moved by an inbox rule.
The ProcessingState for not matching ham mails is also updated using
the ClientClassifierResultService.

This new inbox rule handing solves the following two problems:
 - when clicking on a notification it could still happen,
   that sometimes the inbox rules where not applied
 - when the inbox folder had a lot of unread mails, the loading time did
   massively increase, since inbox rules were re-applied on every load

Co-authored-by: amm <amm@tutao.de>
Co-authored-by: Nick <nif@tutao.de>
Co-authored-by: das <das@tutao.de>
Co-authored-by: abp <abp@tutao.de>
Co-authored-by: jhm <17314077+jomapp@users.noreply.github.com>
Co-authored-by: map <mpfau@users.noreply.github.com>
Co-authored-by: Kinan <104761667+kibibytium@users.noreply.github.com>
2025-10-22 09:40:45 +02:00
das
fd22294a18
[antispam] Add client-side local spam filtering
Implement a local machine learning model for client-side spam filtering.
The local model is implemented using tensorflow "LayersModel" to train
separate models in all available mailboxes, resulting in one model
per ownerGroup (i.e. mailbox).

Initially, the training data is aggregated from the last 30 days of
received mails, and the data is stored in a separate offline database
table named spam_classification_training_data. The trained model is
stored in the table spam_classification_model. The initial training
starts after indexing, with periodic training happening
every 30 minutes and on each subsequent login.

The model will predict on incoming mails once we have received the
entity event for said mail, moving it to either inbox or spam folder.
When users move mails, we update the training data labels accordingly,
by adjusting the isSpam classification and isSpamConfidence values in
the offline database. The MoveMailService now contains a moveReason,
which indicates that the mail has been moved by our spam filter.

Client-side spam filtering can be activated using the
SpamClientClassification feature flag, and is for now only
available on the desktop client.

Co-authored-by: sug <sug@tutao.de>
Co-authored-by: kib <104761667+kibibytium@users.noreply.github.com>
Co-authored-by: abp <abp@tutao.de>
Co-authored-by: map <mpfau@users.noreply.github.com>
Co-authored-by: jhm <17314077+jomapp@users.noreply.github.com>
Co-authored-by: frm <frm@tutao.de>
Co-authored-by: das <das@tutao.de>
Co-authored-by: nif <nif@tutao.de>
Co-authored-by: amm <amm@tutao.de>
2025-10-22 09:25:20 +02:00