Changes between Initial Version and Version 1 of FullTextSearch


Ignore:
Timestamp:
08/16/13 20:22:17 (12 years ago)
Author:
Vishrut Mehta
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • FullTextSearch

    v1 v1  
     1=== Description of Work Done ===
     2
     3* Pylucene
     4 * Installation details are here: http://eden.sahanafoundation.org/wiki/BluePrint/TextSearch#Pylucene
     5* Apache Solr
     6 * The installation details are here:
     7  * [http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr distribution]
     8  * [http://wiki.apache.org/solr/SolrInstall Installation]
     9  * [http://lucene.apache.org/solr/4_3_0/tutorial.html Tutorial]
     10  * Prerequisites:
     11   * Java, at least 1.6
     12   * Unzip the distribution.
     13   * You can, for now, use the setup in the example directory, though this contains sample data that we won't want.
     14   * {{{cd example}}}
     15   * {{{java -jar start.jar}}}
     16   * Open in browser: http://localhost:8983/solr/
     17 
     18* Sunburnt
     19 * The script Attached below installs the dependencies and also configures and installs Apache Solr and Sunburnt[[BR]]
     20
     21* Solr Configuration and Schema changes
     22 * I have attached an installation script for installing all the dependencies and the solr configuration.
     23 * You can also install manually.
     24 * These are the following dependencies you would need to install:
     25  * Antiword
     26{{{
     27wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
     28tar xvzf antiword-0.37.tar.gz
     29cd antiword-0.37
     30make
     31}}}
     32  * Pdfminer
     33{{{
     34wget http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
     35tar xvzf pdfminer-20110515.tar.gz
     36cd pdfminer-20110515
     37python setup.py install
     38}}}
     39  * Pyth
     40{{{
     41wget http://pypi.python.org/packages/source/p/pyth/pyth-0.5.6.tar.gz
     42tar xvzf pyth-0.5.6.tar.gz
     43cd pyth-0.5.6
     44python setup.py install
     45}}}
     46  * Httplib2
     47{{{
     48apt-get install python-httplib2
     49}}}
     50  * xlrd
     51{{{
     52wget http://pypi.python.org/packages/source/x/xlrd/xlrd-0.9.2.tar.gz
     53tar xvzf xlrd-0.9.2.tar.gz
     54cd xlrd-0.9.2
     55python setup.py install
     56}}}
     57  * libxml2, lxml> 3.0
     58{{{
     59apt-get install python-pip
     60apt-get install libxml2 libxslt-dev libxml2-dev
     61pip install lxml==3.0.2
     62}}}
     63 * After installing all these dependencies, we would need to changed the solr config file, solr-4.3.x/example/solr/collection1/conf/solrconfig.xml
     64  * Change the directory path, where you want to store all the indexes. This can be stored anywhere, but better to keep in the eden directory. For example, as mentioned in installation script:
     65{{{
     66<dataDir>/home/<user>/web2py/applications/eden/indices</dataDir>
     67}}}
     68 * You can also change this path according to you suitability, like if solr is on another machine, then the directory path would be different.
     69 * Now, we will discuss the schema changes in file solr-4.3.x/example/solr/collection1/conf/schema.xml -
     70  * Add the following in the <fields>..</fields> tag:
     71{{{
     72<fields>
     73.
     74.
     75
     76<field name="tablename" type="text_general" indexed="true" stored="true"/>
     77<field name="filetype" type="text_general" indexed="true" stored="true"/>
     78<field name="filename" type="text_general" indexed="true" stored="true"/>
     79
     80.
     81.
     82</fields>
     83}}}
     84  * After adding this, after the <fields>..</fields> tag, add the following code for <copyfield>
     85{{{
     86
     87<copyField source="filetype" dest="text"/>
     88<copyField source="tablename" dest="text"/>
     89<copyField source="filename" dest="text"/>
     90
     91}}}
     92  * So these are the configurations and the dependencies required for successful solr and sunburnt installation and integration in Eden.
     93
     94
     95* Enabling text search:
     96-> Uncomment the following line in ''models/000_config.py''
     97{{{
     98# Uncomment this and set the solr url to connect to solr server for Full-Text Search
     99settings.base.solr_url = "http://127.0.0.1:8983/solr/"
     100}}}
     101Specify the appropriate IP, like here it is 127.0.0.1[[BR]]
     102If you are running on different machine, then specify that IP accordingly.
     103
     104
     105* Asynchronously Indexing and Deleting Documents:
     106 * The code for asynchronously indexing documents is in ''models/tasks.py''
     107 * '''Insertion''': The code will first insert the document into the database. Then in callback ''onaccept'' it will index those documents calling the document_create_index() function from models/tasks.py . The following code should be added for enabling Full Text search for documents in any modules. The example is there, you can see modules/s3db/doc.py in document_onaccept() hook.
     108 * '''Deletion''': The code will first delete the record from the database table, then will select that file and will delete it from Solr also, by deleting its index which is stored in solr server. You can look for the code in modules/s3db/doc.py in document_ondelete() hook.
     109 * In model()
     110{{{
     111
     112
     113        if settings.get_base_solr_url():
     114            onaccept = self.document_onaccept # where document_onaccept is the onaccept hook for indexing
     115            ondelete = self.document_ondelete # where document_onaccept is the onaccept hook for deleting
     116        else:
     117            onaccept = None
     118            ondelete = None
     119
     120        configure(tablename,
     121                  onaccept=onaccept,
     122                  ondelete=ondelete,
     123        .....
     124
     125
     126
     127}}}
     128 * In onaccept()
     129{{{
     130    @staticmethod
     131    def document_onaccept(form):
     132
     133        vars = form.vars
     134        doc = vars.file # where file is the name of the upload field
     135
     136        table = current.db.doc_document # doc_document is the tablename
     137        try:
     138            name = table.file.retrieve(doc)[0]
     139        except TypeError:
     140            name = doc
     141
     142        document = json.dumps(dict(filename=doc,
     143                                  name=name,
     144                                  id=vars.id,
     145                                  tablename="doc_document", # where "doc_document" is the name of the database table
     146                                  ))
     147
     148        current.s3task.async("document_create_index",
     149                             args = [document])
     150
     151        return
     152
     153
     154
     155}}}
     156 * in ondelete():
     157{{{
     158    @staticmethod
     159    def document_ondelete(row):
     160
     161        db = current.db
     162        table = db.doc_document # doc_document is the tablename
     163
     164        record = db(table.id == row.id).select(table.file, # where file is the name of the upload field
     165                                               limitby=(0, 1)).first()
     166
     167        document = json.dumps(dict(filename=record.file, # where file is the name of the upload field
     168                                  id=row.id,
     169                                 ))
     170
     171        current.s3task.async("document_delete_index",
     172                             args = [document])
     173
     174        return
     175
     176}}}
     177
     178=== Full Text Functionality ===
     179
     180* The full text search functionality is integrated in modules/s3/s3resource.py, the fulltext() does the work.
     181* The flow is: First the TEXT query goes to transform() function, which would split the query recursively and then transform the query.
     182* After transforming, the query with TEXT operator will go to fulltext() function and would search for the keywords in the indexed documents.
     183* It will retrieve document ids and then covert into a BELONGS S3ResourceQuery.
     184* The code sample for the fulltext() is in ''models/s3/s3resource.py''
     185
     186=== Unit Test ===
     187
     188* Unit Tests are important part of any code function to check ifs its working according to our expectation.
     189* For Full Text Search also, I have implemented it in ''modules/unit_tests/s3/s3resource.py'' - Class DocumentFullTextSearchTests
     190* To run the unit test, type in your web2py folder:
     191{{{python web2py.py -S eden -M -R applications/eden/modules/unit_tests/s3/s3resource.py}}}
     192* For checking the code in different situations, so all errors are removed, there are different tests implemented for it. The cases are:
     193 * When Solr is available (Solr server is running normally)
     194 * When Solr is unavailable (Solr server is not running, but enabled)
     195 * Checking the output of query() function
     196 * Checking the output of call() function
     197