wiki:FullTextSearch

Context Navigation

Full Text Search

Name: Vishrut Mehta
Mentor: Pat Tressel

Description of Work Done

Pylucene
- Installation details are here: http://eden.sahanafoundation.org/wiki/BluePrint/TextSearch#Pylucene
Apache Solr
- The installation details are here:
  - Solr distribution
  - Installation
  - Tutorial
  - Prerequisites:
    - Java, at least 1.6
    - Unzip the distribution.
    - You can, for now, use the setup in the example directory, though this contains sample data that we won't want.
    - cd example
    - java -jar start.jar
    - Open in browser: http://localhost:8983/solr/

Sunburnt
- The script Attached below installs the dependencies and also configures and installs Apache Solr and Sunburnt

Solr Configuration and Schema changes

I have attached an installation script for installing all the dependencies and the solr configuration.
You can also install manually.

These are the following dependencies you would need to install:

Antiword

wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
tar xvzf antiword-0.37.tar.gz
cd antiword-0.37
make

Pdfminer

wget http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
tar xvzf pdfminer-20110515.tar.gz
cd pdfminer-20110515
python setup.py install

Pyth

wget http://pypi.python.org/packages/source/p/pyth/pyth-0.5.6.tar.gz
tar xvzf pyth-0.5.6.tar.gz
cd pyth-0.5.6
python setup.py install

Httplib2
```
apt-get install python-httplib2
```

xlrd

wget http://pypi.python.org/packages/source/x/xlrd/xlrd-0.9.2.tar.gz
tar xvzf xlrd-0.9.2.tar.gz
cd xlrd-0.9.2
python setup.py install

libxml2, lxml> 3.0

apt-get install python-pip
apt-get install libxml2 libxslt-dev libxml2-dev
pip install lxml==3.0.2

After installing all these dependencies, we would need to changed the solr config file, solr-4.3.x/example/solr/collection1/conf/solrconfig.xml
- Change the directory path, where you want to store all the indexes. This can be stored anywhere, but better to keep in the eden directory. For example, as mentioned in installation script:
```
<dataDir>/home/<user>/web2py/applications/eden/indices</dataDir>
```
You can also change this path according to you suitability, like if solr is on another machine, then the directory path would be different.

Now, we will discuss the schema changes in file solr-4.3.x/example/solr/collection1/conf/schema.xml -

Add the following in the <fields>..</fields> tag:

<fields>
.
.

<field name="tablename" type="text_general" indexed="true" stored="true"/>
<field name="filetype" type="text_general" indexed="true" stored="true"/>
<field name="filename" type="text_general" indexed="true" stored="true"/>

.
.
</fields>

After adding this, after the <fields>..</fields> tag, add the following code for <copyfield>

<copyField source="filetype" dest="text"/>
<copyField source="tablename" dest="text"/>
<copyField source="filename" dest="text"/>

So these are the configurations and the dependencies required for successful solr and sunburnt installation and integration in Eden.

Enabling text search:

-> Uncomment the following line in models/000_config.py

# Uncomment this and set the solr url to connect to solr server for Full-Text Search
settings.base.solr_url = "http://127.0.0.1:8983/solr/"

Specify the appropriate IP, like here it is 127.0.0.1
If you are running on different machine, then specify that IP accordingly.

Asynchronously Indexing and Deleting Documents:

The code for asynchronously indexing documents is in models/tasks.py
Insertion: The code will first insert the document into the database. Then in callback onaccept it will index those documents calling the document_create_index() function from models/tasks.py . The following code should be added for enabling Full Text search for documents in any modules. The example is there, you can see modules/s3db/doc.py in document_onaccept() hook.
Deletion: The code will first delete the record from the database table, then will select that file and will delete it from Solr also, by deleting its index which is stored in solr server. You can look for the code in modules/s3db/doc.py in document_ondelete() hook.

In model()


        if settings.get_base_solr_url():
            onaccept = self.document_onaccept # where document_onaccept is the onaccept hook for indexing
            ondelete = self.document_ondelete # where document_onaccept is the onaccept hook for deleting
        else:
            onaccept = None
            ondelete = None

        configure(tablename,
                  onaccept=onaccept,
                  ondelete=ondelete,
        .....

In onaccept()

    @staticmethod
    def document_onaccept(form):

        vars = form.vars
        doc = vars.file # where file is the name of the upload field

        table = current.db.doc_document # doc_document is the tablename
        try:
            name = table.file.retrieve(doc)[0]
        except TypeError:
            name = doc

        document = json.dumps(dict(filename=doc,
                                  name=name,
                                  id=vars.id,
                                  tablename="doc_document", # where "doc_document" is the name of the database table
                                  ))

        current.s3task.async("document_create_index",
                             args = [document])

        return

in ondelete():

    @staticmethod
    def document_ondelete(row):

        db = current.db
        table = db.doc_document # doc_document is the tablename

        record = db(table.id == row.id).select(table.file, # where file is the name of the upload field
                                               limitby=(0, 1)).first()

        document = json.dumps(dict(filename=record.file, # where file is the name of the upload field
                                  id=row.id,
                                 ))

        current.s3task.async("document_delete_index",
                             args = [document])

        return

Full Text Functionality

The full text search functionality is integrated in modules/s3/s3resource.py, the fulltext() does the work.
So, The process is:

First there will be a filter (a new submodule for TextFilter called FullTextFilter) which would generate the query with the TEXT operator.
Then, The query would do a select() call of the extraction of records with the transform=True parameter.
Then, in select(), we call the resource.get_query -> rfilter.get_query() -> transform()
So, in rfilter.get_query, we will transform the TEXT query to a BELONGS S3ResourceQuery by using transform() and fulltext() function
Note: fulltext() will extract the record ids through Lucene, by giving Lucene the text which the user entered into the filter.
After generating the BELONGS S3ResourceQuery with the set of ids, we he convert it to a DAL Query.
So, this is query is returned back to select and a db call for extracting records from this query(filter) will be extracted.
Finally, we display the records with an extra field into the display which will be the document file field name. Also, we dont need to worry about the permission aspect, as for example - like any other query, like, contains, belongs etc, they do there permission management by filtering the ids. Here also, as we are generating a BELONGS query from a TEXT query, the permission management will be done like any other query, AS the fulltext() function is just an intermediate function, and the rest of the permission management is the same.

Note: Still left to design the structure for extracting snippets from the query and displaying it along with the records are left.(a @ToDo).

Unit Test

Unit Tests are important part of any code function to check ifs its working according to our expectation.
For Full Text Search also, I have implemented it in modules/unit_tests/s3/s3resource.py - Class DocumentFullTextSearchTests
To run the unit test, type in your web2py folder:

python web2py.py -S eden -M -R applications/eden/modules/unit_tests/s3/s3resource.py

For checking the code in different situations, so all errors are removed, there are different tests implemented for it. The cases are:
- When Solr is available (Solr server is running normally)
- When Solr is unavailable (Solr server is not running, but enabled)
- Checking the output of query() function
- Checking the output of call() function
- Checking the output of select() when transform=True and solr available
- Checking the output of select() when transform=True and solr Unavailable

Last modified 12 years ago Last modified on 08/16/13 21:51:36

Note: See TracWiki for help on using the wiki.

Download in other formats:

Plain Text