wiki:BluePrint/Messaging/ExtendingParsing

Version 16 (modified by Fran Boon, 10 years ago) ( diff )

--

Blueprint for extending the message parsing framework

The inbound message parsing framework was developed during GSoC 2012. See the 2012 GSoC message parser project.

  • The framework is highly extensible and the parsing workflows are customisable per deployment in the templates. A nice example of this is the NLTK synonym matching filter developed during the H4D2 hackathon.(See here).
  • The system supports multiple communication channels i.e Emails, SMS and Twitter.But, certainly a number of incoming feeds (so not just SMS/Tweets, but also RSS feeds, etc.) can be integrated with the system.So, plugging in the RSS feeds would be one useful step.
  • Things that we want to extract and are essential requirements for the framework are discussed below.

Data Model Changes

  • Make msg_message a Super Entity
  • Each Channel would have an instance of this super entity which acts as the InBox and/or OutBox as-appropriate for that instance type
  • The 'Master Message Log' then becomes the view of the super-entity (rather than having to copy messages here)
  • Move non-core fields to component tables so that the core tables are uncluttered & fast

Input Source Improvements

Reliability/trustworthiness of the message sources/senders

  • Currently, this is done manually through the CRUD interface with the msg_sender data model.

  • A 'river' of messages are processed with starring of senders & adding of keywords on the fly so that the system gradually becomes more automated through the process.
  • We could as well pre-populate the keywords database with the most frequently used keywords (esp. in incident reporting) and the rest can be added on the fly.

Parser Improvements

Topic Detection

  • KeyGraph is used to detect topics across tweets/other feeds to filter relevant and actionable information from the rest of them. This is done after doing a loose filtering of information based on keywords and location.
  • See http://keygraph.codeplex.com/ .

Actionability

  • Is this something that we can actually do something with?
  • Its important to manage the content coming from various message sources and separate the ones that are actionable and contains useful information from the rest of them.

"Whom Should I Follow? Identifying Relevant Users During Crises":

Location

  • Another important requirement is to improve the ability to extract location data out of unstructured text and make sense of ambigous locations.
  • An OpenGeoSMS parser already exists in the default parser template(also available as an API within s3msg.py) which is able to parse lat-lon information of the location from OpenGeoSMS formatted messages. But , it would be great if it could be linked with the database (look the location up from the database).

UI Improvements

http://tldrmpdemo.aidiq.com/eden/org/organisation/1/profile

  • S3Summary:

https://sahana.mybalsamiq.com/projects/sahanacommunityresiliencemappingprojectfinal/naked/Risk+Summary?key=ff49e93ddf8139e5eb61065660c796caa6f95845 http://i.imgur.com/jjaDmQ1.png http://twitris.knoesis.org/indiarain2013/

(subscribe/unsubscribe)

  • Features
    • See all Messages in a datatable/list across media types (FB/Twitter/RSS/YouTube/Flickr)
    • Filter them
    • Add Sender to Whitelist/Blacklist
    • Add Keyword to back-end filters
    • View Images/Video
    • Find Situation Reports
    • ReliefWeb, etc
    • Grouping/Linking results both to enhance validity & also provide a single point of entry
    • Route to other Sahana Modules
    • Drag and Drop between Raw source & Target Module
    • Mark for Action
    • create Tasks
    • create Incident Reports
    • create Assessments
    • create Situation Reports
    • Forward via Outbound Channels (Public e.g. Twitter & Private e.g. Email/SMS)
    • Semantic Search?
    • RDF Channel?

Use Cases

Parsing bounced messages

  • This is very important for IFRC Africa who send out bulk emails to their volunteer base from Eden & want to know which mails are mis-typed / users moved / etc.

Note: See TracWiki for help on using the wiki.