This post is the third part of the multi-part series on how to build a search engine –
- How to build a search engine – Part 1: Installing the tools and getting the technology stack ready
- How to build a search engine – Part 2: Configuring elasticsearch
- How to build a search engine – Part 3: Indexing elasticsearch
- How to build a search engine – Part 4: Building the front end
Just a sneak peek into how the final output is going to look like –
In case you are in a hurry you can find the full code for the project at my Github Page
Assuming the dataset is named “people_wiki.csv”, place the below code in another .py file (let’s say indexing.py) in the same folder as the data.
import pandas as pd import numpy as np import json
import time
from elasticsearch import Elasticsearch
start_time = time.time()
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
start_time = time.time()
data = pd.read_csv('people_wiki.csv')
print 'Data prepared in ' + str((time.time()-start_time)/60) + ' minutes'
json_body = data.reset_index().to_json(orient='index')
json_body = json_body.decode('ascii','ignore').encode('utf-8','replace')
json_parsed = json.loads(json_body)
print np.shape(data)
for elements in json_parsed:
data_json = json_parsed[elements]
id_ = data_json['URI']
es.index(index='wiki_search', doc_type='data', id=id_, body=data_json)
print id_ + ' indexed successfully'
print 'Indexed in '+str((time.time()-start_time)/60)+' minutes'
Executing this script will result in steaming logs which is ultimately leading to the data getting indexed in elasticsearch. That’s how easy it is!
Let’s spend the next few lines on what actually happened.
We declare our elasticsearch object configured on our local machine. Once that object is initialized we will use it to index all of our data.
Pandas is a python library for loading datasets in python and it works great. We use the read_csv() to load our data. Once that is done we have to convert it to a JSON format to send it to elasticsearch for indexing. We do this conversion by using the json library that is shipped with the anaconda distribution.
Every person on our data will be treated as a separate document. We are specifying the URI as the id of the documents as we can be sure that URI will always be unique for a document.
Once all this is in place, we can go ahead and call es.index() with the specified parameters to start indexing our documents iteratively.
Indexing in elasticsearch is the process that it goes through to understand the data beforehand. It will parse the free text, do all the pre-processing already discussed in Part 2 and store the data in shards to achieve blazing fast speed in query time.
Once this script has completed running, the elasticsearch module will be completely ready. From hereon we are done with building the search engine. We just have to build a frontend using AngularJS and all the awesomeness of elastisearch will be accessible through it.
In the next parts we will concentrate on building the front end and we will be done with how to build a search engine – our fuzzy and blazing fast.