wiki:BluePrint/SocialMedia/GHC2013SocialMediaHITProcessing

Context Navigation

Version 28 (modified by Fran Boon, 12 years ago) ( diff )
--

GHC Social Media Project

Introduction

Receive tweets and / or SMS messages from the public.
Dispatch these to online workers to classify and geocode.
Display on a map.

Background

During the Haiti earthquake of Jan 2010, people trapped in buildings sent SMS messages to a designated shortcode. These were classified, translated, and geocoded by online workers using Amazon's Mechanical Turk, then provided to emergency managers.

During the Kenya 2013 general election, citizens and trained election monitors reported election-related incidents via SMS and twitter. These were automatically entered into a map database, then vetted by online workers to remove spam and contact the sender for clarification, before making the information public. See: https://uchaguzi.co.ke/

During a Random Hacks of Kindness hackathon in 2010, a variant of this project was implemented using a Sahana Eden as the back end and a custom web page (not automatically generated by Eden) as the front end. This was designed as a training game -- workers got "experience points" and were awarded badges. See: http://gwob.org/101010-hackathon-winners/

Current System

For ease of getting a large dataset to play with, we will use Twitter for this example, although the same system can be used for other message channels, like Email, SMS & RSS.

Test Server to see User Interface:

http://demo.eden.sahanafoundation.org/eden/msg/

Install on your own system:

Install Eden
Install the TwitterSearch library
Follow the User Guidelines to get a Twitter OAuth account & use this to search

Relevant documentation:

Note: Documentation for both end-users & devs could use improvement

feel free to dive in!

Projects

There are 2 projects that we can work on during this session: (A) Some simple enhancements to the current interface (B) Creating a 'Human Intelligence Tasking' module to allow the processing of these messages to be divided up amongst a lot of workers

Simple Enhancements

The current workflow for a user to search Twitter is very cumbersome!

Some ideas for improvements (although feel free to come up with your own!)

There should be an option (on by default) to have the Search run after save
The next screen should be the Results: s3db.configure(create_next = URL(f="twitter_result"))
There should be some filters above the results:
- http://eden.sahanafoundation.org/wiki/S3/FilterForms
There should be a link to see the results on a Map
- this will require modifying the code to have the msg_twitter_result table use self.gis_location_id() Foreign key instead of !lat/lon fields
- The S3Map() method is then automatically accessible via /eden/msg/twitter_result/map
- This will use the same filters defined for the list view
There should be a Report method configured for charts based on pivot tables
- http://eden.sahanafoundation.org/wiki/S3/S3Report
We could create a 'Summary' view which allows the Table, Chart & Map to be separate tabs on a single page, sharing a common set of filters, keeping the settings when moving between the views
- Example: http://demo.lacrmt.sahanafoundation.org/eden/vulnerability/risk/summary
- This is accessible via /eden/msg/twitter_result/summary
- The Chart, Table, Filter, map settings are inherited
- The configuration of which Tabs to display can be seen in the CRMT template
  - https://github.com/flavour/eden/blob/master/private/templates/CRMT/config.py#L235
We want to add tools to the Table view to Geocode, Classify, Whitelist sender, Blacklist sender

HIT Module

Project breakdown

This project is intended to be easy to subdivide into tasks that can be worked on somewhat independently and in parallel, given the choice of a few naming conventions for new database tables and fields.

In order to keep our work together, and distinct from other work, we'll add a new module. This is the first step in added "human intelligence task" processing, in which results are verified by sending the same task to multiple workers, and comparing the results. So let's call our new module "hit". That means the controller file will be:

eden/controllers/hit.py

The model will be:

eden/modules/s3db/hit.py

The view pages will be in the directory:

eden/view/hit

Everyone may find it useful to refer to:

The lesson on "making a new module" in the Eden book:
http://booki.flossmanuals.net/sahana-eden/_draft/_v/1.0/building-a-new-application/
The index of Eden APIs:
http://eden.sahanafoundation.org/wiki/S3

We'll deal with some data outside the new module, such as:

Received messages are stored in the "message log" table, msg_message.
https://github.com/flavour/eden/blob/master/modules/s3db/msg.py#L93

(This is a special kind of table called (in Eden terminology) a "superentity". This is like a superclass but for database tables. Records in multiple specialized tables have "parent" records in a shared superentity table, so other tables can refer to any of the specialized tables without needing a foreign key field for every one, by instead linking to the superentity record. References to ordinary non-superentity tables are simpler.)

Workers will sign up for accounts on an Eden site. When they sign up, they will have a record in the auth_user table and a record in the pr_person table for profile information. Because there are other things besides people that have addresses and such, there is a superentity for person-like types. But when we know we're referring to an actual person, we refer to their record in pr_person.

Fill in required "new module" boilerplate

Look at the lesson on "making a new module" in the Eden book:
http://booki.flossmanuals.net/sahana-eden/_draft/_v/1.0/building-a-new-application/

That puts the model file in the eden/models directory, but that is just to avoid complication. Models in eden/models are loaded on every http request, whether they're needed or not. Most Eden models are in eden/modules/s3db, and are only loaded by http requests that need them. Since our message processing won't be used by most types of requests, we want it in eden/modules/s3db.

Add the new module to the list of enabled modules. This is normally specified in a "template" that has the customizations for a particular site. Here, we will "cheat" and just add the new module to our configuration file eden/models/000_config.py. Get the default module list from eden/private/templates/default/config.py

Copy it to models/000_config.py and add an entry for the hit module.

Add a database table for message processing tasks

We want to add a table that holds the data entered by one worker for one message. The table will need fields for:

A foreign key reference to the msg_message table. Here is another table with such a reference:
https://github.com/flavour/eden/blob/master/modules/s3db/msg.py#L1609
A category that the worker will assign. This can be just a text field for now. Here's an example of a text field (the from address in a message):
https://github.com/flavour/eden/blob/master/modules/s3db/msg.py#L98
(The category will be empty til the worker fills it in, so we can't require that it be non-empty.)
A location that the worker will enter either by filling out a form and / or clicking on a map. There is a standard widget for selecting locations that will be included automatically if the location is specified as in this example:
https://github.com/flavour/eden/blob/master/modules/s3db/cr.py#L219
Here is where the function that generates the foreign key reference is defined:
https://github.com/flavour/eden/blob/master/modules/s3db/gis.py#L265
The name for the function used outside the gis module includes the "gis_" prefix to avoid ambiguity.

Why do we want a separate table? Why not just add a category and location to the msg_message table? Eventually, we want to do "human intelligence task" processing, in which results are verified by sending the same task to multiple workers, and comparing the results. So we may have more than one set of results for each message. We want to include which worker did each task, so we can check the quality of their work and refer them to more training if needed.

Track which messages have been processed

We don't want to add fields to msg_message just for our module. But we need a way to tell when a message has been processed, so we can select unprocessed messages to give to users. Later, when we add more human intelligence task features such as sending the same task to multiple workers, we'll need somewhere to record how complete the work is for one message.

So we may want to add a table that gets a record added when each message intended for processing arrives. The table should be defined in the same model file as the task table. It should refer to the msg_message record in the same way as above, and should have a boolean field for whether the message is processed. (Later, this can be changed to support processing by multiple workers, but for now, we can consider it done when one worker has processed it.)

We don't want to enter all messages, just ones for our workers (e.g. incoming email for individuals should not be sent to workers). We'll need a way to recognize our messages, e.g. a Twitter direct message recipient or hashtag, or a particular SMS shortcode. There is a new and somewhat experimental feature for selecting out messages, which may be useful for this, but documentation is lacking. This will require consultation on IRC. IF we cannot use this, we can (temporarily) add specialized code in the msg module's incoming message handling to pick out the desired messages and create records for them.

Add a controller function to generate task pages for workers

Look at:

Other controllers in eden/controllers
The documentation for the controller helper function:
wiki:S3/S3REST/s3_rest_controller
The documentation for a custom controller:
http://eden.sahanafoundation.org/wiki/S3/S3Method

Eden will automatically generate pages that correspond to database tables (list forms) or individual records (read or edit forms), or empty forms for adding new records (create forms). However, when a worker requests a task, there is not yet any database record for the task. Instead, we want to:

Select a new, not-yet-processed message.
Create a record in the (new) hit_task table (being worked on by the model team).
Return the standard form for the new record to the user.

Add a view that presents a task to the worker and submits their work

The worker will get an autogenerated "edit" form, produced by the controller and view code together. If a plain edit form is ok, then this may be a very short task -- it may be that no special view file is needed.. So this can probably be left until after the controller is done.

Generate a list of categories from the database

We would like to encourage workers to use existing categories when there is a close enough match, but be able to add new ones if not. So, we want to give the worker a menu of categories to choose from, consisting of all the current categories found in the category field being added by the team working on adding the new tables, and also let the worker add a new category.