Blueprint for Outbound Messaging Enhancements
Table of Contents
Outbound Queue Performance
Issues
We would like to send out new messages in a timely manner, without doing time-consuming sends in request context where they will slow down response to the user, but also handle retrying failed messages and avoid unnecessary overhead.
Right now, inserting a message in the queue (send_by_pe_id) is unconditionally followed by kicking off the outbound sending task (which runs asynchronously, not in request context). That isn't just making an attempt to send the single new message, but rather processes the entire queue. Messages that are in the queue in order to be retried due to an earlier send failure will be retried along with the one new message.
To avoid a time hit to users during requests that currently include a message send (currently the known cases are all registration email), we can instead -- if msg is enabled -- queue up those messages. If we do that, we need to deal with the consequences of sharing the queue between new and retry messages.
Consider these cases:
- Email sending is not a common activity for the site -- sends occur less frequently than the periodic task. In that case, kicking off an immediate run of the entire outbound task is not a significant overhead. If the periodic task is set to run infrequently compared to the frequency of new sends, then an immediate run makes entire sense. If new sends are not frequent, this isn't a lot of overhead.
- However, let's say that the site actively uses messaging -- that it's a common event.
- Say new sends are significantly more frequent than the periodic task. Now the async runs will dominate, and the periodic task and the site's chosen periodic rate will be irrelevant. The conditions that cause a message to fail and be retained in the queue for retry are unlikely to clear up in a short time, especially if the periodic time was chosen to be appropriate for retries. There will be a lot of overhead just from the task processing and futile retries.
- If the periodic task is scheduled frequently, there would, in fact, be little benefit from launching the queue processing on each send.
(Dominic points out that if each outbound send processes the entire queue including retries, then causing outbound sends could be used as a DoS attack. That works especially "well" if the sends are sure to fail and become retries.)
Other issues we want to deal with:
- Make sure queue runs do not overlap.
- Must make sure that inserts while the queue is being processed are (thread) safe.
- Use logging rather than prints to stderr for error (etc.) messages intended for the site admin. (This is more general than messaging, but we could just start using it and get the pattern down before converting other prints to stderr.)
Options
Given that, if msg is enabled, we want to queue up new messages to avoid slow request response, and given that retries are a significant potential overhead, one option is to separate new message processing from retries. Sub-options:
- Use separate queues for new messages versus retries. After a run of the new message queue, each message will either be sent or will fail and be inserted in the retry queue, so -- except for messages inserted during the run -- the new message queue will be empty. This does not require a change to the queuing mechanism. Sub-sub-options:
- Launch the new message queue asynchronously when each new message is posted. Here, there would be a periodic task just for the retry queue.
- Defer new message sends to a periodic job that runs sufficiently frequently to not cause user annoyance. Here, there would be periodic tasks for both new and retry messages.
- Mark retries and add a parameter that specifies whether retries should be processed on a given run. This requires a change to the queuing mechanism, and although retries would not be attempted during a new message run, they would still have to be iterated over.
Since email will likely be a significant part of messaging, we might look at typical outbound email processing and retry handling. Email send failures fall into two categories:
- Permanent failures, for which no retry will succeed. These include "no such user". In traditional email processing (read, sendmail), these failures get bounced right away. It doesn't serve any purpose to keep processing permanent failures as retries. Some we might deal with automatically, but if not, we do want to keep them around for the site admin to deal with. We could move these to a holding queue.
- Temporary failures, for which a retry may be succeed at some point. These include mailer misconfiguration or outage, no response from destination host, "user mailbox full"... Traditional email processing will eventually bounce even temporary failures. (If we want different retry rates for messages that have repeatedly failed (or, say, for different urgency levels), we could use separate queues with separate periodic tasks. Messages could be demoted into less frequent queues based on number of retry failures.)
Speaking of traditional email processing, one option is to hand off sends to a standard external mailer, like postfix. Advantages are:
- We wouldn't be reinventing the wheel.
- Postfix has excellent email handling.
- Dominic points out that it may be possible to use postfix for other transports besides email.
Disadvantages:
- If we can't use postfix for all the other transports, we'd still have to keep send queuing in Eden.
- We wouldn't get direct feedback from send attempts. Getting error info would allow warning users if their sends aren't going through within the context of msg, which is where they sent from. It would also allow alerting the site admin to config and operational problems. If postfix (or other external mailer) has an API, we might be able to get errors that way. Or we might scrape errors from its log. That might be more trouble than doing the queuing inside Eden.
- Postfix doesn't run under Windows (appears to be a broad consensus on this) -- it would need to be run on Linux in a virtual machine. There are mailers for Windows, such as Exchange or hMailServer, but they'd need different configuration.
Proposal
Initial part:
- If msg is enabled, queue up new messages rather than sending them directly.
- Split outbound processing into three queues:
- New messages
- Retries
- Failures (immediately permanent fails and max retries exceeded)
- Let site choose periodic task frequencies, and whether to start an async new message processing run when a new message is posted or use a periodic task for new messages.
- Include advice in the user guidelines for the above choices, based on the site's messaging behavior.
- Add logging of errors. Detect obvious misconfiguration and object strenuously.
- Verify that there are no problems with (e.g.) orphaned entries if there are multiple accessors to the queues.
- Allow sending to bare addresses without pentities. For now, add two fields to the queue entry: address text and transport type. See below for discussion.
Later:
- Add automated failure handling, and / or a way for the admin to review and deal with the failures.
- Attempt to detect per-transport issues that are causing send failures:
- If the issue affects all messages for that transport, skip other messages for that transport in the current processing run.
- If the issue is something that the site admin might need to deal with, let them know.
- Check whether we already prevent overlapping runs. (Note the conditional scheduling I was working on, but that wasn't needed for Ashwyn's GSoC project, did deal with this, so this may be why I thought we might already be handling this.)
- Does this deal with the DoS?
Email Error Handling
Web2py's Mail.send raises exceptions, contradicting its docstring. In particular, a common configuration error, failing to edit the example email sender, will cause a ticket when a user attempts to register. (Registration includes an email send even if no verification or approval is needed, as the user will get a welcome message in that case.) Since this is not a user error, but rather a misconfiguration that the site admin needs to deal with, it should not block registration nor show an error to the user. (See also ticket #439 -- we currently delete the user's temporary registration if there is an email error, but don't report anything to the site admin.)
We would prefer to have Web2py's Mail.send behavior fixed in Web2py, but this could be a backward-compatibility issue, and might take some time to resolve. So for now we'll subclass Mail (as MailS3) and show what we'd like done in Web2py. We can store our MailS3 instance in current.mail so it will be used by Web2py's Auth for those actions we don't override -- password change and username recovery.
Since we're subclassing Mail anyway, we can introduce other changes. In particular, we might want to use the outbound message queue rather than having the user wait for a direct send to be performed. This solves the issue #439 with configuration errors versus registration-related email. The user will not see any email errors, we can leave in their temporary registration, and the actual send during queue processing will log an error for the site admin to deal with. When the configuration is repaired, the message should be retried and sent without further intervention.
These enhancements need to be in MailS3.send so they're used by Web2py's Auth, but they're not what we want pushed as a fix to Web2py's send and we still need access to the direct, non-queued send for outbound queue processing, so the send fix will be included in MailS3 as send_direct_email.
A new user does not have a pentity while going through email verification, and the approver is denoted only by an email address, so in order to queue up those messages we would like to extend outbound queuing for...
Sending to addresses without pentities
Cases where we might want to send a message, but lack a pentity, include:
- Temporary users during registration processing.
- The approver, currently only denoted by an email address.
- Distribution list members, who may not be site users.
Options for supporting addresses that don't have associated pentities:
- Add a bare address field to the outbound message queue elements. Either support only email, or encode the transport into the address, e.g. via an IANA scheme (mailto, sms,...)
- Add a bare address field and a transport type field.
- Create a pr_contact record for the address -- the contact record would hold the address and transport type -- and add a reference to the contact in the queue element.
- Give the bare address a pentity and a pr_contact entry.
The avoids (to some extent) the need to modify queue processing. For registration, one might hand this pentity over to the temporary user on successful completion of registration...or just get rid of it when the user completes registration and receives their "real" pentity.
If we include an explicit transport type field, either in the queue entry or in a contact, we don't have to attempt to infer the transport from the form of the address, nor include a scheme.
The approver is awkward for assigning a pentity. We don't anticipate using anything but email for this. We shouldn't ask the site to create a user for their approver and then enter the pe_id in their config file. It's mainly awkward because we don't have web setup. Once we have web setup, we'll have to deal with saving site config info somewhere -- file or database or... -- so saving the pe_id of the approver won't be different from saving the key for any other selected configuration item that's a record in the database. There's an additional awkwardness in that the site will have to add the approver as a user before they can configure them as the approver, so they'll have to get that user in before they can start registering other users normally. Is this case sufficient to derail the option of giving bare addresses pentities and contact records?