wiki:BluePrint/TextSearch

Context Navigation

Version 44 (modified by Vishrut Mehta, 12 years ago) ( diff )
--

BluePrint: Text Search

Introduction

The Blueprint outlines the development of two functionalities for Sahana Eden. They are:

Full-text Search

It will provides users to search for text in uploaded documents.

Global Search

It will provide users to search for a particular query really quick over multiple resources together(eg. Organization, Hospital, etc.).

Stakeholders

The End Users - The group of end users are a heterogeneous mass and would like to search to any type of specific queries.

Site Administrator, Site Operators,

Agencies like:

TLDRMP : Timor Leste Disaster Risk Management Portal

CSN : Community Stakeholder Network

LAC: Los Angeles County

IFRC : International Federation of Red Cross and Red Crescent Societies

User Stories

For Full-Text Search:

For Unrelated Search

As a user, I want -

To be able to search through contents into ALL the documents as as a user I may not know the resource or relevant module to search into.

As a developer or an admin, I want -

expand and would like to test my search results and increase the efficiency of the search result.

For Related Search

As a user, I want -

To be able to search through contents into relevant resource as I would be somewhat familiar with the system and would know where to search for what. For eg, If i want to search for all the reports/docs related to Hospital, i will go to Hospital module and write a search query for a particular document.

As a developer, I would like to -

expand the current functionality and increase the efficiency of the search results.

For Global Search:

The user will type the query of the upper right corner of the the page, where the global search option will be there. So as a User -

I may want quick and rough results of every records of all resource, rather going to a particular resource and try to search in that. This will provide the user rough set of search list to the user of the different resources.

As an agency,

The current IFRC template design has this functionality, so as a part of this project, I will try to accomplish this, but will be a lower priority than the document search

Use Case Analysis

Target Problem

The Full-Text Search will try to solve two problem:

To find one particular item(document/record).

To find all items Relevant for a task the user is to perform.

The Global Search will solve the following problem:

To search for a particular item through all the Resources.

The Use Case for this is the IFRC UX design contains the global search query box, so we would add this functionality to these type of agencies.

Benefit to Sahana Eden

A new functionality for searching text into the documents will be implemented as a part of this project. Till now no such system exits. So when the user wants to search for a particular document/report, relevant text search should be able to give efficient search results.

For IFRC, the UX design contains a Global Search option on its upper right corner. This will not be an urgent priority(could be done in the third trimester).

Requirements

Functional

For Full Text Search:

Proper understanding and the work model of S3Filter is required.

Literature study of Apache Lucene and Pylucene. Getting familiar with Pylucene and deploy it into my local machine.

Studying the linkage of the Lucene daemon and web2py server.

Extend the functionality of S3Filter by introducing an addition feature (which is a text field) to search for text through documents.

A user interface for displaying the search result.

For Global Search

Appropriate understanding for the S3Filter, and we would use S3TextFilter after configuring the context links, so need to learn that also..

Need to search for a particular resource over all the resources.

Efficient search mechanism to search over all the resource.

Non-functional

Standards

System Constraints

The user should have Pylucene installed in there machine.

Also, while starting the web2py server, the Lucence deamon should also start.

In case of failure, the search query related to full-text search will not be functional.

Use-Cases

Design

Workflows

For Full-Text Search:

The workflow will be according to the use case described above. First the user should need to analyze whether he wants to search through all the uploaded documents or any resource specific document.

Unrelated Search:

The user should go the the Document -> Search.

For Simple Search: Tick the checkbox to search through uploaded documents.

For Advanced Search: Type the following query(string) he wants to search for.

The output will be the response of the Lucene Daemon running in the background.

The response will be displayed in the form as the search result for that query.

Related Search:

This type of search is to search for text in uploaded documents over multiple resources.

Go to that particular resource(eg. Hospital, Organization, etc) -> Search.

You have to type the particular resource in which you want to search(normal search) and also string in the text filter for document search.

For Simple Search: Tick the checkbox to search through uploaded documents in related resources.

For Advanced Search: It will search on the basis of the resource associated to the document and the search string for the document search.(So we will limit our search to only those documents for which a particular resource is associated.)

The response will be recorded and displayed as an output. Here there will be two categories of output. The first one would be the normal output of the search filter form and after that we would show the search results for the output of the full-text based query (the display/UI would be same as for unrelated search).

For Global Search:

The user want to search through all the resource, then he will go to the global search option on the upper right corner of IFRC template.

The user need to type the resource or text he wants to search for.

The back-end processing happens and efficiently filters the resources and displays the output.

The main task on how to categorize the output. We will divide the output form into category of resources in alphabetical order. For eg. Assets, CAP, GIS, Hospitals, etc.. So the category heading would be this resource and the output will be the corresponding records which match the search request.

Site Map

Wireframes

todo

Technologies

The technology going to be used are:

S3Filter

The resource for these are here:
http://eden.sahanafoundation.org/wiki/S3/FilterForms

Apache Lucene and Pylucene

A comparison analysis was done here whether to choose between Apache Lucene or Apache Solr.
http://www.alliancetek.com/Blog/post/2011/09/22/Solr-Vs-Lucene-e28093-Which-Full-Text-Search-Solution-Should-You-Use.aspx
Solr is a platform that uses the Lucene library, the only time it may be preferable to use Lucene is if you want to embed search functionality into your own application. So I choose Lucene for indexing the documents and search string in those documents.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Refer this for more information about its functtionalities:
http://lucene.apache.org/core/

Implementation

Full-Text Search

It consists of extending the usage of S3Filter to document search by creating new TextFilter field in the document search form as well as all other resources.

When a user upload its document, it is indexed using the Lucene deamon, which will be running at background.

As and when new document is uploaded or edited, it will be indexed, so as to search in it efficiently. Lucene provides a library which does its indexing and stuff efficiently.

When a user enters a query(For simple search, its a checkbox and normal query and for advanced search, it will be a input text field), a request will be sent to the deamon and the deamon will search through the indexed documents and give the output search results.

There is also Full-text search over different resources, which would need the resources in which the user wants to search for.

This would be accomplished by using Pylucene, which is a wrapper on Apache Lucene in Python to carry out these tasks.

The main thing we could focus on is the efficiency of the search, as we know it will be Computationally challenging to perform accurate search.

After the response, the part which remains will be displaying the search results in a proper user friendly format.

Future Implementation:

UI is a secondary concern for how to display the search result. We could take inspiration from the Google and Bing! Search results for an attractive UI format.

Global Search

We will look a the design implementation of IFRC template, and add the global search text box using S3Filter and S3TextFilter.

We would not search for a particular record over here, but would filter all widgets for common context. As Dominic guided me, the solution is generic - i.e. it can use any filter widget (text, options, date/time etc.) - but requires the configuration of a context link for every resource (e.g. how is the resource linked to the event).

For eg. If I configure a S3Profile page with a bunch of resources, introspect all text fields in the respective tables. After doing this, configure them as context links, and then add an S3TextFilter for this context.So thus, then this would let me search for text phrases through all these resources.

After doing this and filtering out the common context, we would need a UI to display it. It would be simple, user friendly and categorized. Showing it in a much better way would be for future implementation.

References

Chats and Discussions

http://logs.sahanafoundation.org/sahana-eden/2013-03-24.txt
http://logs.sahanafoundation.org/sahana-eden/2013-03-24.txt
http://logs.sahanafoundation.org/sahana-eden/2013-04-20.txt

Online Resources

http://lucene.apache.org/core/4_2_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term
http://lucene.apache.org/pylucene/features.html
http://oak.cs.ucla.edu/cs144/projects/lucene/

BluePrint

Attachments (1)

search.png (32.4 KB ) - added by Vishrut Mehta 12 years ago.

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text

Context Navigation

BluePrint: Text Search

Table of Contents

Introduction

Full-text Search

Global Search

Stakeholders

User Stories

For Unrelated Search

For Related Search

Use Case Analysis

Target Problem

Benefit to Sahana Eden

Requirements

Functional

Non-functional

Standards

System Constraints

Use-Cases

Design

Workflows

For Full-Text Search:

For Global Search:

Site Map

Wireframes

Technologies

Implementation

Full-Text Search

Global Search

References

Chats and Discussions

Online Resources

Attachments (1)

Download in other formats: