Changes between Version 40 and Version 41 of Event/2013/GSoC/TextSearch


Ignore:
Timestamp:
08/17/13 12:28:47 (8 years ago)
Author:
Vishrut Mehta
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • Event/2013/GSoC/TextSearch

    v40 v41  
    2929   * https://trello.com/b/LkHQycbZ/gsoc-13-search-project
    3030
    31 === Description of Work Done ===
    32 
    33 * Pylucene
    34  * Installation details are here: http://eden.sahanafoundation.org/wiki/BluePrint/TextSearch#Pylucene
    35 * Apache Solr
    36  * The installation details are here:
    37   * [http://lucene.apache.org/solr/mirrors-solr-latest-redir.html Solr distribution]
    38   * [http://wiki.apache.org/solr/SolrInstall Installation]
    39   * [http://lucene.apache.org/solr/4_3_0/tutorial.html Tutorial]
    40   * Prerequisites:
    41    * Java, at least 1.6
    42    * Unzip the distribution.
    43    * You can, for now, use the setup in the example directory, though this contains sample data that we won't want.
    44    * {{{cd example}}}
    45    * {{{java -jar start.jar}}}
    46    * Open in browser: http://localhost:8983/solr/
    47  
    48 * Sunburnt
    49  * The script Attached below installs the dependencies and also configures and installs Apache Solr and Sunburnt[[BR]]
    50 
    51 * Solr Configuration and Schema changes
    52  * I have attached an installation script for installing all the dependencies and the solr configuration.
    53  * You can also install manually.
    54  * These are the following dependencies you would need to install:
    55   * Antiword
    56 {{{
    57 wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
    58 tar xvzf antiword-0.37.tar.gz
    59 cd antiword-0.37
    60 make
    61 }}}
    62   * Pdfminer
    63 {{{
    64 wget http://pypi.python.org/packages/source/p/pdfminer/pdfminer-20110515.tar.gz
    65 tar xvzf pdfminer-20110515.tar.gz
    66 cd pdfminer-20110515
    67 python setup.py install
    68 }}}
    69   * Pyth
    70 {{{
    71 wget http://pypi.python.org/packages/source/p/pyth/pyth-0.5.6.tar.gz
    72 tar xvzf pyth-0.5.6.tar.gz
    73 cd pyth-0.5.6
    74 python setup.py install
    75 }}}
    76   * Httplib2
    77 {{{
    78 apt-get install python-httplib2
    79 }}}
    80   * xlrd
    81 {{{
    82 wget http://pypi.python.org/packages/source/x/xlrd/xlrd-0.9.2.tar.gz
    83 tar xvzf xlrd-0.9.2.tar.gz
    84 cd xlrd-0.9.2
    85 python setup.py install
    86 }}}
    87   * libxml2, lxml> 3.0
    88 {{{
    89 apt-get install python-pip
    90 apt-get install libxml2 libxslt-dev libxml2-dev
    91 pip install lxml==3.0.2
    92 }}}
    93  * After installing all these dependencies, we would need to changed the solr config file, solr-4.3.x/example/solr/collection1/conf/solrconfig.xml
    94   * Change the directory path, where you want to store all the indexes. This can be stored anywhere, but better to keep in the eden directory. For example, as mentioned in installation script:
    95 {{{
    96 <dataDir>/home/<user>/web2py/applications/eden/indices</dataDir>
    97 }}}
    98  * You can also change this path according to you suitability, like if solr is on another machine, then the directory path would be different.
    99  * Now, we will discuss the schema changes in file solr-4.3.x/example/solr/collection1/conf/schema.xml -
    100   * Add the following in the <fields>..</fields> tag:
    101 {{{
    102 <fields>
    103 .
    104 .
    105 
    106 <field name="tablename" type="text_general" indexed="true" stored="true"/>
    107 <field name="filetype" type="text_general" indexed="true" stored="true"/>
    108 <field name="filename" type="text_general" indexed="true" stored="true"/>
    109 
    110 .
    111 .
    112 </fields>
    113 }}}
    114   * After adding this, after the <fields>..</fields> tag, add the following code for <copyfield>
    115 {{{
    116 
    117 <copyField source="filetype" dest="text"/>
    118 <copyField source="tablename" dest="text"/>
    119 <copyField source="filename" dest="text"/>
    120 
    121 }}}
    122   * So these are the configurations and the dependencies required for successful solr and sunburnt installation and integration in Eden.
    123 
    124 
    125 * Enabling text search:
    126 -> Uncomment the following line in ''models/000_config.py''
    127 {{{
    128 # Uncomment this and set the solr url to connect to solr server for Full-Text Search
    129 settings.base.solr_url = "http://127.0.0.1:8983/solr/"
    130 }}}
    131 Specify the appropriate IP, like here it is 127.0.0.1[[BR]]
    132 If you are running on different machine, then specify that IP accordingly.
    133 
    134 
    135 * Asynchronously Indexing and Deleting Documents:
    136  * The code for asynchronously indexing documents is in ''models/tasks.py''
    137  * '''Insertion''': The code will first insert the document into the database. Then in callback ''onaccept'' it will index those documents calling the document_create_index() function from models/tasks.py . The following code should be added for enabling Full Text search for documents in any modules. The example is there, you can see modules/s3db/doc.py in document_onaccept() hook.
    138  * '''Deletion''': The code will first delete the record from the database table, then will select that file and will delete it from Solr also, by deleting its index which is stored in solr server. You can look for the code in modules/s3db/doc.py in document_ondelete() hook.
    139  * In model()
    140 {{{
    141 
    142 
    143         if settings.get_base_solr_url():
    144             onaccept = self.document_onaccept # where document_onaccept is the onaccept hook for indexing
    145             ondelete = self.document_ondelete # where document_onaccept is the onaccept hook for deleting
    146         else:
    147             onaccept = None
    148             ondelete = None
    149 
    150         configure(tablename,
    151                   onaccept=onaccept,
    152                   ondelete=ondelete,
    153         .....
    154 
    155 
    156 
    157 }}}
    158  * In onaccept()
    159 {{{
    160     @staticmethod
    161     def document_onaccept(form):
    162 
    163         vars = form.vars
    164         doc = vars.file # where file is the name of the upload field
    165 
    166         table = current.db.doc_document # doc_document is the tablename
    167         try:
    168             name = table.file.retrieve(doc)[0]
    169         except TypeError:
    170             name = doc
    171 
    172         document = json.dumps(dict(filename=doc,
    173                                   name=name,
    174                                   id=vars.id,
    175                                   tablename="doc_document", # where "doc_document" is the name of the database table
    176                                   ))
    177 
    178         current.s3task.async("document_create_index",
    179                              args = [document])
    180 
    181         return
    182 
    183 
    184 
    185 }}}
    186  * in ondelete():
    187 {{{
    188     @staticmethod
    189     def document_ondelete(row):
    190 
    191         db = current.db
    192         table = db.doc_document # doc_document is the tablename
    193 
    194         record = db(table.id == row.id).select(table.file, # where file is the name of the upload field
    195                                                limitby=(0, 1)).first()
    196 
    197         document = json.dumps(dict(filename=record.file, # where file is the name of the upload field
    198                                   id=row.id,
    199                                  ))
    200 
    201         current.s3task.async("document_delete_index",
    202                              args = [document])
    203 
    204         return
    205 
    206 }}}
    207 
    208 === Full Text Functionality ===
    209 
    210 * The full text search functionality is integrated in modules/s3/s3resource.py, the fulltext() does the work.
    211 * The flow is: First the TEXT query goes to transform() function, which would split the query recursively and then transform the query.
    212 * After transforming, the query with TEXT operator will go to fulltext() function and would search for the keywords in the indexed documents.
    213 * It will retrieve document ids and then covert into a BELONGS S3ResourceQuery.
    214 * The code sample for the fulltext() is in ''models/s3/s3resource.py''
    215 
    216 === Unit Test ===
    217 
    218 * Unit Tests are important part of any code function to check ifs its working according to our expectation.
    219 * For Full Text Search also, I have implemented it in ''modules/unit_tests/s3/s3resource.py'' - Class DocumentFullTextSearchTests
    220 * To run the unit test, type in your web2py folder:
    221 {{{python web2py.py -S eden -M -R applications/eden/modules/unit_tests/s3/s3resource.py}}}
    222 * For checking the code in different situations, so all errors are removed, there are different tests implemented for it. The cases are:
    223  * When Solr is available (Solr server is running normally)
    224  * When Solr is unavailable (Solr server is not running, but enabled)
    225  * Checking the output of query() function
    226  * Checking the output of call() function
    227 
    228 [[BR]][[BR]]
    229 
    230 
    231 
     31[[BR]]
     32[[BR]]
    23233
    23334||= SMART Goal =||= Measure =||= Status =||