|Version 10 (modified by 8 years ago) ( diff ),|
Blueprint for Outbound Messaging Enhancements
Table of Contents
Outbound Queue Performance
We would like to send out new messages in a timely manner, without doing time-consuming sends in request context where they will slow down response to the user, but also handle retrying failed messages and avoid unnecessary overhead.
Right now, inserting a message in the queue (send_by_pe_id) is unconditionally followed by kicking off the outbound sending task (which runs asynchronously, not in request context). That isn't just making an attempt to send the single new message, but rather processes the entire queue. Messages that are in the queue in order to be retried due to an earlier send failure will be retried along with the one new message.
To avoid a time hit to users during requests that currently include a message send (currently the known cases are all registration email), we can instead queue up those messages. If we do that, we need to deal with the consequences of sharing the queue between new and retry messages.
Consider these cases:
- Email sending is not a common activity for the site -- sends occur less frequently than the periodic task. In that case, kicking off an immediate run of the entire outbound task is not a significant overhead. If the periodic task is set to run infrequently compared to the frequency of new sends, then an immediate run makes entire sense. If new sends are not frequent, this isn't a lot of overhead.
- However, let's say that the site actively uses messaging -- that it's a common event.
- Say new sends are significantly more frequent than the periodic task. Now the async runs will dominate, and the periodic task and the site's chosen periodic rate will be irrelevant. The conditions that cause a message to fail and be retained in the queue for retry are unlikely to clear up in a short time, especially if the periodic time was chosen to be appropriate for retries. There will be a lot of overhead just from the task processing and futile retries.
- If the periodic task is scheduled frequently, there would, in fact, be little benefit from launching the queue processing on each send.
(Dominic points out that if each outbound send processes the entire queue including retries, then causing outbound sends could be used as a DoS attack. That works especially "well" if the sends are sure to fail and become retries.)
Other issues we want to deal with:
- Make sure queue runs do not overlap.
- Must make sure that inserts while the queue is being processed are (thread) safe.
Given that we want to queue up new messages rather than attempt a direct send before queuing (to avoid slow request response), and given that retries are a significant potential overhead, one option is to separate new message processing from retries. Sub-options:
- Use separate queues for new messages versus retries. After a run of the new message queue, each message will either be sent or will fail and be inserted in the retry queue, so -- except for messages inserted during the run -- the new message queue will be empty. This does not require a change to the queuing mechanism. Sub-sub-options:
- Launch the new message queue asynchronously when each new message is posted. Here, there would be a periodic task just for the retry queue.
- Defer new message sends to a periodic job that runs sufficiently frequently to not cause user annoyance. Here, there would be periodic tasks for both new and retry messages.
- Mark retries and add a parameter that specifies whether retries should be processed on a given run. This requires a change to the queuing mechanism, and although retries would not be attempted during a new message run, they would still have to be iterated over.
Since email will likely be a significant part of messaging, we might look at typical outbound email processing and retry handling. Email send failures fall into two categories:
- Permanent failures, for which no retry will succeed. These include "no such user". In traditional email processing (read, sendmail), these failures get bounced right away. It doesn't serve any purpose to keep processing permanent failures as retries. Some we might deal with automatically, but if not, we do want to keep them around for the site admin to deal with. We could move these to a holding queue.
- Temporary failures, for which a retry may be succeed at some point. These include mailer misconfiguration or outage, no response from destination host, "user mailbox full"... Traditional email processing will eventually bounce even temporary failures. (If we want different retry rates for messages that have repeatedly failed (or, say, for different urgency levels), we could use separate queues with separate periodic tasks. Messages could be demoted into less frequent queues based on number of retry failures.)
- Split outbound processing into three queues:
- New messages
- Failures (immediately permanent fails and max retries exceeded)
- Let site choose periodic task frequencies, and whether to start an async new message processing run when a new message is posted or use a periodic task for new messages.
- Include advice in the user guidelines for the above choices, based on the site's messaging behavior.
- Add logging of errors. Detect obvious misconfiguration and object strenuously.
- Add automated failure handling, and / or a way for the admin to review and deal with the failures.
- Attempt to detect transport issues that the site admin might need to deal with.
Email Error Handling
Web2py's Mail.send raises exceptions, contradicting its docstring. In particular, a common configuration error, failing to edit the example email sender, will cause a ticket when a user attempts to register. (Registration includes an email send even if no verification or approval is needed, as the user will get a welcome message in that case.) Since this is not a user error, but rather a misconfiguration that the site admin needs to deal with, it should not block registration nor show an error to the user. (See also ticket #439 -- we currently delete the user's temporary registration if there is an email error, but don't report anything to the site admin.)
We would prefer to have Web2py's Mail.send behavior fixed in Web2py, but this could be a backward-compatibility issue, and might take some time to resolve. So for now we'll subclass Mail (as MailS3) and show what we'd like done in Web2py. We can store our MailS3 instance in current.mail so it will be used by Web2py's Auth for those actions we don't override -- password change and username recovery.
Since we're subclassing Mail anyway, we can introduce other changes. In particular, we might want to use the outbound message queue rather than having the user wait for a direct send to be performed. This solves the issue #439 with configuration errors versus registration-related email. The user will not see any email errors, we can leave in their temporary registration, and the actual send during queue processing will log an error for the site admin to deal with. When the configuration is repaired, the message should be retried and sent without further intervention.
These enhancements need to be in MailS3.send so they're used by Web2py's Auth, but they're not what we want pushed as a fix to Web2py's send and we still need access to the direct, non-queued send for outbound queue processing, so the send fix will be included in MailS3 as send_direct_email.
A new user does not have a pentity while going through email verification, and the approver is denoted only by an email address, so in order to queue up those messages we would like to extend outbound queuing for...