= Project: Full-Text Search =
Name : '''Vishrut Mehta'''[[BR]]
Mentor: '''Pat Tressel'''[[BR]]
=== Proposal ===
The proposal for the project is here:[[BR]]
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2013/vishrutmehta/6001
=== BluePrints ===
This project draws ideas from the Blueprints below:[[BR]]
* BluePrint/TextSearch
=== Meetings And Discussions ===
'''Weekly Meeting''' : '''Tuesday and Saturday 04:30 UTC'''
[[BR]]
'''Venue''' : '''IRC''' [[BR]]
Nick - vishrut009
[[BR]]
'''Google Group Discussions''' :
* https://groups.google.com/forum/#!topic/sahana-eden/E0S7Hl_hjWo [[BR]]
* https://groups.google.com/forum/#!topic/sahana-eden/9XwK4955cmg [[BR]]
* For the Project, my mentor Pat and I used trello, to keep track on the work done so far - link is:
* https://trello.com/b/LkHQycbZ/gsoc-13-search-project
=== Description of Work Done ===
* Pylucene
* Installation details are here: http://eden.sahanafoundation.org/wiki/BluePrint/TextSearch#Pylucene
* Apache Solr
* The installation details are here:
* [http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr distribution]
* [http://wiki.apache.org/solr/SolrInstall Installation]
* [http://lucene.apache.org/solr/4_3_0/tutorial.html Tutorial]
* Prerequisites:
* Java, at least 1.6
* Unzip the distribution.
* You can, for now, use the setup in the example directory, though this contains sample data that we won't want.
* {{{cd example}}}
* {{{java -jar start.jar}}}
* Open in browser: http://localhost:8983/solr/
* Sunburnt
* The script Attached below installs the dependencies and also configures and installs Apache Solr and Sunburnt[[BR]]
* Solr Configuration and Schema changes
* I have attached an installation script for installing all the dependencies and the solr configuration.
* You can also install manually.
* These are the following dependencies you would need to install:
* Antiword
{{{
wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
tar xvzf antiword-0.37.tar.gz
cd antiword-0.37
make
}}}
* Pdfminer
{{{
wget http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
tar xvzf pdfminer-20110515.tar.gz
cd pdfminer-20110515
python setup.py install
}}}
* Pyth
{{{
wget http://pypi.python.org/packages/source/p/pyth/pyth-0.5.6.tar.gz
tar xvzf pyth-0.5.6.tar.gz
cd pyth-0.5.6
python setup.py install
}}}
* Httplib2
{{{
apt-get install python-httplib2
}}}
* xlrd
{{{
wget http://pypi.python.org/packages/source/x/xlrd/xlrd-0.9.2.tar.gz
tar xvzf xlrd-0.9.2.tar.gz
cd xlrd-0.9.2
python setup.py install
}}}
* libxml2, lxml> 3.0
{{{
apt-get install python-pip
apt-get install libxml2 libxslt-dev libxml2-dev
pip install lxml==3.0.2
}}}
* After installing all these dependencies, we would need to changed the solr config file, solr-4.3.x/example/solr/collection1/conf/solrconfig.xml
* Change the directory path, where you want to store all the indexes. This can be stored anywhere, but better to keep in the eden directory. For example, as mentioned in installation script:
{{{
/home//web2py/applications/eden/indices
}}}
* You can also change this path according to you suitability, like if solr is on another machine, then the directory path would be different.
* Now, we will discuss the schema changes in file solr-4.3.x/example/solr/collection1/conf/schema.xml -
* Add the following in the .. tag:
{{{
.
.
.
.
}}}
* After adding this, after the .. tag, add the following code for
{{{
}}}
* So these are the configurations and the dependencies required for successful solr and sunburnt installation and integration in Eden.
* Enabling text search:
-> Uncomment the following line in ''models/000_config.py''
{{{
# Uncomment this and set the solr url to connect to solr server for Full-Text Search
settings.base.solr_url = "http://127.0.0.1:8983/solr/"
}}}
Specify the appropriate IP, like here it is 127.0.0.1[[BR]]
If you are running on different machine, then specify that IP accordingly.
* Asynchronously Indexing and Deleting Documents:
* The code for asynchronously indexing documents is in ''models/tasks.py''
* '''Insertion''': The code will first insert the document into the database. Then in callback ''onaccept'' it will index those documents calling the document_create_index() function from models/tasks.py . The following code should be added for enabling Full Text search for documents in any modules. The example is there, you can see modules/s3db/doc.py in document_onaccept() hook.
* '''Deletion''': The code will first delete the record from the database table, then will select that file and will delete it from Solr also, by deleting its index which is stored in solr server. You can look for the code in modules/s3db/doc.py in document_ondelete() hook.
* In model()
{{{
if settings.get_base_solr_url():
onaccept = self.document_onaccept # where document_onaccept is the onaccept hook for indexing
ondelete = self.document_ondelete # where document_onaccept is the onaccept hook for deleting
else:
onaccept = None
ondelete = None
configure(tablename,
onaccept=onaccept,
ondelete=ondelete,
.....
}}}
* In onaccept()
{{{
@staticmethod
def document_onaccept(form):
vars = form.vars
doc = vars.file # where file is the name of the upload field
table = current.db.doc_document # doc_document is the tablename
try:
name = table.file.retrieve(doc)[0]
except TypeError:
name = doc
document = json.dumps(dict(filename=doc,
name=name,
id=vars.id,
tablename="doc_document", # where "doc_document" is the name of the database table
))
current.s3task.async("document_create_index",
args = [document])
return
}}}
* in ondelete():
{{{
@staticmethod
def document_ondelete(row):
db = current.db
table = db.doc_document # doc_document is the tablename
record = db(table.id == row.id).select(table.file, # where file is the name of the upload field
limitby=(0, 1)).first()
document = json.dumps(dict(filename=record.file, # where file is the name of the upload field
id=row.id,
))
current.s3task.async("document_delete_index",
args = [document])
return
}}}
=== Full Text Functionality ===
* The full text search functionality is integrated in modules/s3/s3resource.py, the fulltext() does the work.
* The flow is: First the TEXT query goes to transform() function, which would split the query recursively and then transform the query.
* After transforming, the query with TEXT operator will go to fulltext() function and would search for the keywords in the indexed documents.
* It will retrieve document ids and then covert into a BELONGS S3ResourceQuery.
* The code sample for the fulltext() is in ''models/s3/s3resource.py''
=== Unit Test ===
* Unit Tests are important part of any code function to check ifs its working according to our expectation.
* For Full Text Search also, I have implemented it in ''modules/unit_tests/s3/s3resource.py'' - Class DocumentFullTextSearchTests
* To run the unit test, type in your web2py folder:
{{{python web2py.py -S eden -M -R applications/eden/modules/unit_tests/s3/s3resource.py}}}
* For checking the code in different situations, so all errors are removed, there are different tests implemented for it. The cases are:
* When Solr is available (Solr server is running normally)
* When Solr is unavailable (Solr server is not running, but enabled)
* Checking the output of query() function
* Checking the output of call() function
[[BR]][[BR]]
||= SMART Goal =||= Measure =||= Status =||
|| Explore Pylucene || Installed and configured on demo server || Completed ||
|| Scripts for indexing and search in pylucene || Scripts working on the demo server || Completed ||
|| Explore Apache Solr and Sunburnt || Installed both on demo and local server || Completed ||
|| Scripts for indexing and search for sunburnt || Working scripts for sunburnt ready || Completed ||
|| Asynchronously Indexing and Deleting Document || Implemented & Integrated in Sahana Eden || Completed ||
|| Install Script foe Installing and Configuring Solr and sunburnt || Below is the link of the script || Completed ||
|| Designing the Full-Text search functionality implementation || Discussed with Dominic and Fran || Completed ||
|| Implementation of fulltext() function in s3resource.py || Successfully implemented with Error handling || Completed ||
|| Implemented a transform() function to transform a TEXT query to belong query || Successfully implemented with Error handling || Completed ||
|| Unit tests for all cases(solr un/available, query(), call() ) || Implemented the unit tests for s3resource || Completed ||
|| Generic Indexing onaccept() hook, can be integrated in any module || Implemented and tested on doc_document || Completed ||
|| Transform parameter in select() || Should be implemented in s3resource.py || Under Review ||
|| Unit Tests for transform parameter in select() || Specific test cases included to check || Under Review ||
|| To extract the snippet and return into transform function || Along with the document, a snippet of text also should be shown || ||
|| New TextFilter for FullText Search || Implement in new search Filter s3filter.py || ||
|| Javascript for FullTextFilter || in s3.filter.js to implement || ||
|| Presentation/Display of records || Records for documents should be displayed || ||