Changes between Version 73 and Version 74 of BluePrint/TextSearch


Ignore:
Timestamp:
08/16/13 20:18:36 (11 years ago)
Author:
Vishrut Mehta
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BluePrint/TextSearch

    v73 v74  
    2712711) Selenium tests for correctness of the code and the Eden framework [[BR]]
    2722722) Unit tests for efficiency test(it will also cover manual tests to check the efficiency of the search results)
     273
     274
     275=== Description of Work Done ===
     276
     277* Pylucene
     278 * Installation details are here: http://eden.sahanafoundation.org/wiki/BluePrint/TextSearch#Pylucene
     279* Apache Solr
     280 * The installation details are here:
     281  * [http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr distribution]
     282  * [http://wiki.apache.org/solr/SolrInstall Installation]
     283  * [http://lucene.apache.org/solr/4_3_0/tutorial.html Tutorial]
     284  * Prerequisites:
     285   * Java, at least 1.6
     286   * Unzip the distribution.
     287   * You can, for now, use the setup in the example directory, though this contains sample data that we won't want.
     288   * {{{cd example}}}
     289   * {{{java -jar start.jar}}}
     290   * Open in browser: http://localhost:8983/solr/
     291 
     292* Sunburnt
     293 * The script Attached below installs the dependencies and also configures and installs Apache Solr and Sunburnt[[BR]]
     294
     295* Solr Configuration and Schema changes
     296 * I have attached an installation script for installing all the dependencies and the solr configuration.
     297 * You can also install manually.
     298 * These are the following dependencies you would need to install:
     299  * Antiword
     300{{{
     301wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
     302tar xvzf antiword-0.37.tar.gz
     303cd antiword-0.37
     304make
     305}}}
     306  * Pdfminer
     307{{{
     308wget http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
     309tar xvzf pdfminer-20110515.tar.gz
     310cd pdfminer-20110515
     311python setup.py install
     312}}}
     313  * Pyth
     314{{{
     315wget http://pypi.python.org/packages/source/p/pyth/pyth-0.5.6.tar.gz
     316tar xvzf pyth-0.5.6.tar.gz
     317cd pyth-0.5.6
     318python setup.py install
     319}}}
     320  * Httplib2
     321{{{
     322apt-get install python-httplib2
     323}}}
     324  * xlrd
     325{{{
     326wget http://pypi.python.org/packages/source/x/xlrd/xlrd-0.9.2.tar.gz
     327tar xvzf xlrd-0.9.2.tar.gz
     328cd xlrd-0.9.2
     329python setup.py install
     330}}}
     331  * libxml2, lxml> 3.0
     332{{{
     333apt-get install python-pip
     334apt-get install libxml2 libxslt-dev libxml2-dev
     335pip install lxml==3.0.2
     336}}}
     337 * After installing all these dependencies, we would need to changed the solr config file, solr-4.3.x/example/solr/collection1/conf/solrconfig.xml
     338  * Change the directory path, where you want to store all the indexes. This can be stored anywhere, but better to keep in the eden directory. For example, as mentioned in installation script:
     339{{{
     340<dataDir>/home/<user>/web2py/applications/eden/indices</dataDir>
     341}}}
     342 * You can also change this path according to you suitability, like if solr is on another machine, then the directory path would be different.
     343 * Now, we will discuss the schema changes in file solr-4.3.x/example/solr/collection1/conf/schema.xml -
     344  * Add the following in the <fields>..</fields> tag:
     345{{{
     346<fields>
     347.
     348.
     349
     350<field name="tablename" type="text_general" indexed="true" stored="true"/>
     351<field name="filetype" type="text_general" indexed="true" stored="true"/>
     352<field name="filename" type="text_general" indexed="true" stored="true"/>
     353
     354.
     355.
     356</fields>
     357}}}
     358  * After adding this, after the <fields>..</fields> tag, add the following code for <copyfield>
     359{{{
     360
     361<copyField source="filetype" dest="text"/>
     362<copyField source="tablename" dest="text"/>
     363<copyField source="filename" dest="text"/>
     364
     365}}}
     366  * So these are the configurations and the dependencies required for successful solr and sunburnt installation and integration in Eden.
     367
     368
     369* Enabling text search:
     370-> Uncomment the following line in ''models/000_config.py''
     371{{{
     372# Uncomment this and set the solr url to connect to solr server for Full-Text Search
     373settings.base.solr_url = "http://127.0.0.1:8983/solr/"
     374}}}
     375Specify the appropriate IP, like here it is 127.0.0.1[[BR]]
     376If you are running on different machine, then specify that IP accordingly.
     377
     378
     379* Asynchronously Indexing and Deleting Documents:
     380 * The code for asynchronously indexing documents is in ''models/tasks.py''
     381 * '''Insertion''': The code will first insert the document into the database. Then in callback ''onaccept'' it will index those documents calling the document_create_index() function from models/tasks.py . The following code should be added for enabling Full Text search for documents in any modules. The example is there, you can see modules/s3db/doc.py in document_onaccept() hook.
     382 * '''Deletion''': The code will first delete the record from the database table, then will select that file and will delete it from Solr also, by deleting its index which is stored in solr server. You can look for the code in modules/s3db/doc.py in document_ondelete() hook.
     383 * In model()
     384{{{
     385
     386
     387        if settings.get_base_solr_url():
     388            onaccept = self.document_onaccept # where document_onaccept is the onaccept hook for indexing
     389            ondelete = self.document_ondelete # where document_onaccept is the onaccept hook for deleting
     390        else:
     391            onaccept = None
     392            ondelete = None
     393
     394        configure(tablename,
     395                  onaccept=onaccept,
     396                  ondelete=ondelete,
     397        .....
     398
     399
     400
     401}}}
     402 * In onaccept()
     403{{{
     404    @staticmethod
     405    def document_onaccept(form):
     406
     407        vars = form.vars
     408        doc = vars.file # where file is the name of the upload field
     409
     410        table = current.db.doc_document # doc_document is the tablename
     411        try:
     412            name = table.file.retrieve(doc)[0]
     413        except TypeError:
     414            name = doc
     415
     416        document = json.dumps(dict(filename=doc,
     417                                  name=name,
     418                                  id=vars.id,
     419                                  tablename="doc_document", # where "doc_document" is the name of the database table
     420                                  ))
     421
     422        current.s3task.async("document_create_index",
     423                             args = [document])
     424
     425        return
     426
     427
     428
     429}}}
     430 * in ondelete():
     431{{{
     432    @staticmethod
     433    def document_ondelete(row):
     434
     435        db = current.db
     436        table = db.doc_document # doc_document is the tablename
     437
     438        record = db(table.id == row.id).select(table.file, # where file is the name of the upload field
     439                                               limitby=(0, 1)).first()
     440
     441        document = json.dumps(dict(filename=record.file, # where file is the name of the upload field
     442                                  id=row.id,
     443                                 ))
     444
     445        current.s3task.async("document_delete_index",
     446                             args = [document])
     447
     448        return
     449
     450}}}
     451
     452=== Full Text Functionality ===
     453
     454* The full text search functionality is integrated in modules/s3/s3resource.py, the fulltext() does the work.
     455* The flow is: First the TEXT query goes to transform() function, which would split the query recursively and then transform the query.
     456* After transforming, the query with TEXT operator will go to fulltext() function and would search for the keywords in the indexed documents.
     457* It will retrieve document ids and then covert into a BELONGS S3ResourceQuery.
     458* The code sample for the fulltext() is in ''models/s3/s3resource.py''
     459
     460=== Unit Test ===
     461
     462* Unit Tests are important part of any code function to check ifs its working according to our expectation.
     463* For Full Text Search also, I have implemented it in ''modules/unit_tests/s3/s3resource.py'' - Class DocumentFullTextSearchTests
     464* To run the unit test, type in your web2py folder:
     465{{{python web2py.py -S eden -M -R applications/eden/modules/unit_tests/s3/s3resource.py}}}
     466* For checking the code in different situations, so all errors are removed, there are different tests implemented for it. The cases are:
     467 * When Solr is available (Solr server is running normally)
     468 * When Solr is unavailable (Solr server is not running, but enabled)
     469 * Checking the output of query() function
     470 * Checking the output of call() function
     471
     472
    273473
    274474== References ==