This post is the second part of the multi-part series on how to build a search engine –
- How to build a search engine – Part 1: Installing the tools and getting the technology stack ready
- How to build a search engine – Part 2: Configuring elasticsearch
- How to build a search engine – Part 3: Indexing elasticsearch
- How to build a search engine – Part 4: Building the front end
Just a sneak peek into how the final output is going to look like –
In case you are in a hurry you can find the full code for the project at my Github Page
In this post we will focus on configuring the elasticsearch bit.
I have chosen the Wikipedia people dump for the dataset. This is the wiki pages of a subset of people on Wikipedia. This dataset consists of three columns – URI, name, text. As the column names suggest, URI is the actual wiki link to that person’s page, name is the person’s name. Text is the field where the entire content has been dumped on that person. This column is unstructured as it contains free text.
We would mainly be focusing on how to build a search engine to search through the text column and display the corresponding name and URI.
If you are already sitting on some other dataset you want to build the search engine on, then everything that we are going to do still applies; it’s just changing the column names to make it work!
So let’s get started. The first thing that we need to do is configure elasticsearch so that we can load some data into it.
If you are going to move forward with the Wikipedia data, just save the below script in a .py file and give it a run. Caution: Elasticsearch must be up and running on your system (find details in Part 1).
import requestssettings = '''
{
"settings" : {
"analysis" : {
"filter": {
"trigrams_filter": {
"type": "ngram",
"min_gram": 5,
"max_gram": 8
}
},
"analyzer" : {
"stem" : {
"tokenizer" : "standard",
"filter" : ["standard", "lowercase", "stop", "porter_stem","trigrams_filter"]
},
"my_ngram_analyzer" : {
"tokenizer" : "my_ngram_tokenizer"
}
},
"tokenizer" : {
"my_ngram_tokenizer" : {
"type" : "nGram",
"min_gram" : "4",
"max_gram" : "8"
}
}
}
},
"mappings" : {
"index_type_1" : {
"dynamic" : true,
"properties" : {
"text" : {
"type" : "string",
"analyzer" : "stem"
} ,
"name" : {
"type" : "string",
"analyzer" : "simple"
}
}
},
"index_type_suggest" : {
"properties" : {
"search_query" : {
"type" : "completion"
}
}
}
}
}
'''
url = 'http://localhost:9200/wiki_search'
resp_del = requests.delete(url)
print resp_del
resp = requests.put(url,data=settings)
print resp
Let’s try going through some of the code. settings is a doc string in python that dictates all that we want to do to our data.
The n-grams filter is for subset-pattern-matching. This means if I search “start”, it will get a match on the word “restart” (start is a subset pattern match on restart)
Before indexing, we want to make sure the data goes through some pre-processing. This is defined in the line “filter” : [“standard”, “lowercase”, “stop”, “porter_stem”,”trigrams_filter”] They are as follows –
- Standard – standard word tokenization based on word boundaries. Basically it means splitting words in a sentence.
- Lowercase – converting all text to lowercase so that the data becomes more standardized.
- Stop – removing stopwords. We humans have a habit of writing verbose sentences and does not make much sense to include while we are processing it in an algorithm. These words are called stopwords. For example – I sort of really did not want to go there. The highlighted text signifies a rough example of a stopword. The sentence’s meaning would still remain the same if those words are removed.
- porter_stem – this is an algorithm for stemming. Stemming is the process of reducing a word to its parent form. For example – printing will be converted to print, printed will also be converted to print. This helps in further standardizing the data.
- Trigrams_filter – this is for the subset pattern matching already discussed above.
“search_query” will be the property responsible for the autocomplete suggestions which we will integrate with the UI later. It’s type is “completion” which lets elasticsearch know that this data will be retrieved as we type, and that it has to be optimized for speed.
Finally, we create a new elasticsearch index called ”wiki_search” that would define the endpoint URL where we would be interested in calling the RESTful service of elasticsearch from our UI.
In the next segment of how to build a search engine we would be looking at indexing the data which would make our search engine practically ready.
Originally posted here