Overview

The project involves implementing three different search mechanisms: a linear search, a hashtable search using Python’s built-in dict objects, and a hashtable search using a custom implementation. All three mechanisms should give the same results, but the linear search is expected to be much slower.

The project also includes creating an index for faster lookups and generating HTML output to display the search results. The data sets used for the project are articles from Slate magazine and Berlitz travelogues.

Dataset

A search engine takes in one or more search terms and looks for files within a corpus that contain all of those terms. A corpus is a collection of text files, often organized within a directory and its subdirectories. For example, the American National corpus contains a large amount of text data. In this project, articles from Slate magazine and Berlitz travelogues have been extracted for use as data sets.

Screen Shot 2023-04-01 at 4.52.31 PM.png

Methodologies

Linear Search:

The first step in this project is to implement a linear search algorithm that checks each file to see if it contains all the search terms. Matching filenames are added to a list. The time complexity is O(kn), where k is the average number of words per file and n is the number of files, but it can be approximated as O(n).

Hash Table using dict():

Instead of searching each file for every query, creating an index that maps words to the files containing them is more efficient. The search results for multiple words are found by computing each word's intersection of the document sets. The index returns a set or unique list of document indexes, avoiding duplication of filename strings

Screen Shot 2023-04-01 at 4.53.22 PM.png

Received message. Instead of searching each file for every query, creating an index that maps words to the files containing them is more efficient. The search results for multiple words are found by computing each word's intersection of the document sets. The index returns a set or unique list of document indexes, avoiding duplication of filename strings.

Creating an index takes about the same time as performing one linear search because both involve going through the list of files. The complexity of index creation is O(n) for n total words in all files. However, once the index is created, searching takes constant time, or O(1).

Index Search using a custom Hash Table:

Computing hash codes:

Hash codes are computed for string and integer keys supports any object as key, not just strings and integers and the mapped values can be of any type. The hash code for a string could be the sum of its character ASCII codes, but this could result in many collisions. A collision occurs when different keys hash to the same bucket. The distribution of elements to buckets depends on the number of buckets and the quality of the hash function.

I used the built-in hash(o) function to get hashcodes for all key objects:

h = hash(key) % len(self.buckets)