Changes between Version 74 and Version 75 of BluePrint/TextSearch


Ignore:
Timestamp:
08/16/13 20:24:24 (11 years ago)
Author:
Vishrut Mehta
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrint/TextSearch

    v74 v75  
    2722722) Unit tests for efficiency test(it will also cover manual tests to check the efficiency of the search results)
    273273
    274 
    275 === Description of Work Done ===
    276 
    277 * Pylucene
    278  * Installation details are here: http://eden.sahanafoundation.org/wiki/BluePrint/TextSearch#Pylucene
    279 * Apache Solr
    280  * The installation details are here:
    281   * [http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr distribution]
    282   * [http://wiki.apache.org/solr/SolrInstall Installation]
    283   * [http://lucene.apache.org/solr/4_3_0/tutorial.html Tutorial]
    284   * Prerequisites:
    285    * Java, at least 1.6
    286    * Unzip the distribution.
    287    * You can, for now, use the setup in the example directory, though this contains sample data that we won't want.
    288    * {{{cd example}}}
    289    * {{{java -jar start.jar}}}
    290    * Open in browser: http://localhost:8983/solr/
    291  
    292 * Sunburnt
    293  * The script Attached below installs the dependencies and also configures and installs Apache Solr and Sunburnt[[BR]]
    294 
    295 * Solr Configuration and Schema changes
    296  * I have attached an installation script for installing all the dependencies and the solr configuration.
    297  * You can also install manually.
    298  * These are the following dependencies you would need to install:
    299   * Antiword
    300 {{{
    301 wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
    302 tar xvzf antiword-0.37.tar.gz
    303 cd antiword-0.37
    304 make
    305 }}}
    306   * Pdfminer
    307 {{{
    308 wget http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
    309 tar xvzf pdfminer-20110515.tar.gz
    310 cd pdfminer-20110515
    311 python setup.py install
    312 }}}
    313   * Pyth
    314 {{{
    315 wget http://pypi.python.org/packages/source/p/pyth/pyth-0.5.6.tar.gz
    316 tar xvzf pyth-0.5.6.tar.gz
    317 cd pyth-0.5.6
    318 python setup.py install
    319 }}}
    320   * Httplib2
    321 {{{
    322 apt-get install python-httplib2
    323 }}}
    324   * xlrd
    325 {{{
    326 wget http://pypi.python.org/packages/source/x/xlrd/xlrd-0.9.2.tar.gz
    327 tar xvzf xlrd-0.9.2.tar.gz
    328 cd xlrd-0.9.2
    329 python setup.py install
    330 }}}
    331   * libxml2, lxml> 3.0
    332 {{{
    333 apt-get install python-pip
    334 apt-get install libxml2 libxslt-dev libxml2-dev
    335 pip install lxml==3.0.2
    336 }}}
    337  * After installing all these dependencies, we would need to changed the solr config file, solr-4.3.x/example/solr/collection1/conf/solrconfig.xml
    338   * Change the directory path, where you want to store all the indexes. This can be stored anywhere, but better to keep in the eden directory. For example, as mentioned in installation script:
    339 {{{
    340 <dataDir>/home/<user>/web2py/applications/eden/indices</dataDir>
    341 }}}
    342  * You can also change this path according to you suitability, like if solr is on another machine, then the directory path would be different.
    343  * Now, we will discuss the schema changes in file solr-4.3.x/example/solr/collection1/conf/schema.xml -
    344   * Add the following in the <fields>..</fields> tag:
    345 {{{
    346 <fields>
    347 .
    348 .
    349 
    350 <field name="tablename" type="text_general" indexed="true" stored="true"/>
    351 <field name="filetype" type="text_general" indexed="true" stored="true"/>
    352 <field name="filename" type="text_general" indexed="true" stored="true"/>
    353 
    354 .
    355 .
    356 </fields>
    357 }}}
    358   * After adding this, after the <fields>..</fields> tag, add the following code for <copyfield>
    359 {{{
    360 
    361 <copyField source="filetype" dest="text"/>
    362 <copyField source="tablename" dest="text"/>
    363 <copyField source="filename" dest="text"/>
    364 
    365 }}}
    366   * So these are the configurations and the dependencies required for successful solr and sunburnt installation and integration in Eden.
    367 
    368 
    369 * Enabling text search:
    370 -> Uncomment the following line in ''models/000_config.py''
    371 {{{
    372 # Uncomment this and set the solr url to connect to solr server for Full-Text Search
    373 settings.base.solr_url = "http://127.0.0.1:8983/solr/"
    374 }}}
    375 Specify the appropriate IP, like here it is 127.0.0.1[[BR]]
    376 If you are running on different machine, then specify that IP accordingly.
    377 
    378 
    379 * Asynchronously Indexing and Deleting Documents:
    380  * The code for asynchronously indexing documents is in ''models/tasks.py''
    381  * '''Insertion''': The code will first insert the document into the database. Then in callback ''onaccept'' it will index those documents calling the document_create_index() function from models/tasks.py . The following code should be added for enabling Full Text search for documents in any modules. The example is there, you can see modules/s3db/doc.py in document_onaccept() hook.
    382  * '''Deletion''': The code will first delete the record from the database table, then will select that file and will delete it from Solr also, by deleting its index which is stored in solr server. You can look for the code in modules/s3db/doc.py in document_ondelete() hook.
    383  * In model()
    384 {{{
    385 
    386 
    387         if settings.get_base_solr_url():
    388             onaccept = self.document_onaccept # where document_onaccept is the onaccept hook for indexing
    389             ondelete = self.document_ondelete # where document_onaccept is the onaccept hook for deleting
    390         else:
    391             onaccept = None
    392             ondelete = None
    393 
    394         configure(tablename,
    395                   onaccept=onaccept,
    396                   ondelete=ondelete,
    397         .....
    398 
    399 
    400 
    401 }}}
    402  * In onaccept()
    403 {{{
    404     @staticmethod
    405     def document_onaccept(form):
    406 
    407         vars = form.vars
    408         doc = vars.file # where file is the name of the upload field
    409 
    410         table = current.db.doc_document # doc_document is the tablename
    411         try:
    412             name = table.file.retrieve(doc)[0]
    413         except TypeError:
    414             name = doc
    415 
    416         document = json.dumps(dict(filename=doc,
    417                                   name=name,
    418                                   id=vars.id,
    419                                   tablename="doc_document", # where "doc_document" is the name of the database table
    420                                   ))
    421 
    422         current.s3task.async("document_create_index",
    423                              args = [document])
    424 
    425         return
    426 
    427 
    428 
    429 }}}
    430  * in ondelete():
    431 {{{
    432     @staticmethod
    433     def document_ondelete(row):
    434 
    435         db = current.db
    436         table = db.doc_document # doc_document is the tablename
    437 
    438         record = db(table.id == row.id).select(table.file, # where file is the name of the upload field
    439                                                limitby=(0, 1)).first()
    440 
    441         document = json.dumps(dict(filename=record.file, # where file is the name of the upload field
    442                                   id=row.id,
    443                                  ))
    444 
    445         current.s3task.async("document_delete_index",
    446                              args = [document])
    447 
    448         return
    449 
    450 }}}
    451274
    452275=== Full Text Functionality ===