wiki:BluePrint/TextSearch

Version 27 (modified by Vishrut Mehta, 12 years ago) ( diff )

--

BluePrint: Text Search

Introduction

The Blueprint outlines the development of two functionalities for Sahana Eden. They are:

Full-text Search

  • It will provides users to search for text in uploaded documents.

Global Search

  • It will provide users to search for text over multiple resources (eg. Organization, Hospital, etc.).

Stakeholders

  • The End Users

Agencies like:

  • TLDRMP : Timor Leste Disaster Risk Management Portal
  • CSN : Community Stakeholder Network
  • LAC: Los Angeles County
  • IFRC : International Federation of Red Cross and Red Crescent Societies

User Stories

For Full-Text Search:

  • The user will have to go to Document module and Search option. Then he has to type the text(or the specified query format) to search in all the uploaded documents.
  • The result will be the name on the document, the link to the uploaded documents(specifying the authorization checks) and a line of text containing the search string.

Requirements

Functional

For Full Text Search:

  • Proper understanding and the work model of S3Search(depricated) and S3Filter is required.
  • Literature study of Apache Lucene and Pylucene. Getting familiar with Pylucene and deploy it into my local machine.
  • Studying the linkage of the Lucene daemon and web2py server.
  • Extend the functionality of S3Filter by introducing an addition feature (which is a text field) to search for text through documents.
  • A user interface for displaying the search result.

Non-functional

Standards

System Constraints

  • The user should have Pylucene installed in there machine.
  • Also, while starting the web2py server, the Lucence deamon should also start.
  • In case of failure, the search query related to full-text search will not be functional.

Use-Cases

Design

Workflows

For Full-Text Search:

  • The workflow will be according to the use case described above. First the user should need to analyze whether he wants to search through all the uploaded documents or any resource specific document.

Unrelated Search:

  • The user should go the the Document -> Search.
  • Type the following query(string) he wants to search for.
  • The output will be the response of the Lucene Daemon running in the background.
  • The response will be displayed in the form as the search result for that query.

Related Search:

  • This type of search is to search for text in uploaded documents over multiple resources.
  • Go to that particular resource(eg. Hospital, Organization, etc) -> Search.
  • You have to type the particular resource in which you want to search(normal search) and also string in the text filter for document search.
  • So, it will search on the basis of the resource associated to the document and the search string for the document search.(So we will limit our search to only those documents for which a particular resource is associated.)
  • The response will be recorded and displayed as an output. Here there will be two categories of output. The first one would be the normal output of the search filter form and after that we would show the search results for the output of the full-text based query (the display/UI would be same as for unrelated search).

Site Map

Wireframes

todo

Technologies

The technology going to be used are:

  • S3Filter

The resource for these are here:
http://eden.sahanafoundation.org/wiki/S3/FilterForms

  • Apache Lucene and Pylucene

A comparison analysis was done here whether to choose between Apache Lucene or Apache Solr.
http://www.alliancetek.com/Blog/post/2011/09/22/Solr-Vs-Lucene-e28093-Which-Full-Text-Search-Solution-Should-You-Use.aspx
Solr is a platform that uses the Lucene library, the only time it may be preferable to use Lucene is if you want to embed search functionality into your own application. So I choose Lucene for indexing the documents and search string in those documents.
Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Refer this for more information about its functtionalities:
http://lucene.apache.org/core/

Implementation

  • It consists of extending the usage of S3Filter to document search by creating new TextFilter field in the document search form as well as all other resources.
  • When a user upload its document, it is indexed using the Lucene deamon, which will be running at background.
  • As and when new document is uploaded or edited, it will be indexed, so as to search in it efficiently. Lucene provides a library which does its indexing and stuff efficiently.
  • When a user enters a query, a request will be sent to the deamon and the deamon will search through the indexed documents and give the output search results.
  • There is also Full-text search over different resources, which would need the resources in which the user wants to search for.
  • This would be accomplished by using Pylucene, which is a wrapper on Apache Lucene in Python to carry out these tasks.
  • After the response, the part which remains will be displaying the search results in a proper user friendly format.


Future Implementation:

  • UI is a secondary concern for how to display the search result. We could take inspiration from the Google and Bing! Search results for an attractive UI format.

References

Chats and Discussions

http://logs.sahanafoundation.org/sahana-eden/2013-03-24.txt
http://logs.sahanafoundation.org/sahana-eden/2013-03-24.txt
http://logs.sahanafoundation.org/sahana-eden/2013-04-20.txt

Online Resources

http://lucene.apache.org/core/4_2_1/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term
http://lucene.apache.org/pylucene/features.html
http://oak.cs.ucla.edu/cs144/projects/lucene/


BluePrint

Attachments (1)

Download all attachments as: .zip

Note: See TracWiki for help on using the wiki.