Skip to content
Snippets Groups Projects
README.md 2.76 KiB

cs510-project

Tutorial

Before running the software, here are a few points about the data files

  1. The dataset (academic papers) need to be placed in a folder ‘papers_to_index’

  2. Each document is expected to be an xml file with the following fields:

  3. Some other files are expected to be present before the code can run successfully:

  4. docs.json: a json file consisting of all the documents. This file can be used when the raw training data is available as a JSON. Each line represents one document and contains, at the least, keyPhrases, paperAbstract, title and introduction. These fields are used for generating features for training our neural network.

  5. train_queries.json: Each line is of the form {"qid": "the query id", "query": "the query string", "ana": {the annotated entity id and frequency}} . Used for training when raw data is JSON.

  6. train_queries_qrel: relevance judgement for the training queries.

  7. trainqueries.xml: Use this file when the raw training data is present in an XML format.

  8. The above 4 files are from Freebase and are similar to the files used in the searchengine assignment

  9. supervisedTrain.txt: The file which is generated on training the neural net with the training data. Contains the feature values.

##Steps:

  1. Install all necessary modules by running pip3 install -r requirements.txt
  2. To build the index for the dataset, run python3 IndexBuilder.py. The index will be built from the documents present in ‘papers_to_index’ in the current directory; and creates the index in a folder ‘index’. Create an empty folder ‘index’ if the code automatically doesn’t create it.
  3. Once the index is built (might take a while), the next step is to generate our features for training the neural network. Ensure that the necessary files are placed. The files are similar to the files provided for search engine assignment and named ‘docs.json’, ‘train_queries.json’ and ‘train_queries_qrel’. The features will be created in a file ‘supervisedTrain.txt’ when the one of the following commands is executed: python3 train_academicdata.py for JSON data or python3 train_generaldata.py for XML data. Note that train_generaldata is trained for a different set of features and will have to be updated accordingly
  4. Execute python3 neuralnetregressor.py to train the model on the features generated and the model is saved into a file plsaNeuralNetModel.model
  5. After the neural net is trained, the software is ready to be run. Execute python3 SearchEngine.py which is hosted on localhost.
  6. In the console, you can see the link where the server started. Browse that URL to open the UI. (or) open the index.html file on the browser which is configured to connect to the localhost.