Run update21, to create the cs21/labs/08 directory. Then cd into your cs21/labs/08 directory where you will complete the program in the file webQueries.py.
In this lab you will build a mini web search engine. You will read in a set of web pages based on their URLs. For each page, using linear search, you will build a sorted list of unique words that appear on that page and the number of times that they appear. Then, using binary search, you will respond to queries from the user by summing up the number of times each query word appears on each web page. The more times the query words appear, the more relevant the page. You will report the search results in relevance order, only showing those pages that have at least one occurrence of a query word. Here is a sample run of how your program will work.
In addition, you may optionally choose to create your own
web page on the CS system and add it to the search engine. For
example, a very basic web page has been created for the
user lmas. It has a single sentence which reads "I like to
play ultimate frisbee, soccer, and run track." Once this web page is
added to the web search engine, here is another
sample run demonstrating that this web
page is now being searched as well.
Edit the file webQueries.py where you will implement your mini web search engine. Notice that the file includes several completed functions, some stubbed out functions, and some unit tests for these stubbed out functions. Read through the explanation below and follow the instructions highlighted in blue.
The completed functions:
The main program includes a test of these completed functions:
urls = getURLs() wordList = extractWords(urls[0]) print(wordList)Test these functions now. Replace index 0 with the index of your cs21 instructor in the url list, and then execute the program to see the specific word list generated by your instructor's web page. If you go to your instructor's web page you will see that this list of words matches the words, in order, that appear on their home page (excluding the short and numeric words).
The stubbed out functions:
[["apple", 3], ["banana", 1], ["mango", 2]]Notice that this list is sorted based on the dictionary ordering of the words: "apple" comes before "banana", which comes before "mango". Remember that a single index into such a list will give you the entire sublist:
sortedWordCounts[1] # The sublist ["banana", 1]A double index into such a list will give you a component of the sublist:
sortedWordCounts[1][0] # The string "banana" sortedWordCounts[1][1] # The count 1The purpose of this function is to use linear search to find the correct index for the given word so that it will be placed in sorted order into the existing list. There are two cases you will need to handle:
Implement the findSortedPosition function now. Feel free to use the linear search code that we developed in class as the starting point for this function. You will need to adjust it to work with lists of lists and to stop when you find the correct location for a word that is NOT in the list. Look at the function called unitTests1. This contains test cases showing how this function should work. In the main program, call these unit tests to ensure that your implementation is correct.
Implement the countWords function now. Look at the function called unitTests2. This contains test cases showing how this function should work. In the main program, call these unit tests to ensure that your implementation is correct.
This function uses binary search to look up the given word in a sorted list of words and their counts.
Implement the findWordCount function now. Feel free to use the binary search code that we developed in class as a starting point. You will need to adjust it to work with a list of lists. Look at the function called unitTests3. This contains test cases showing how this function should work. In the main program, call these unit tests to ensure that your implementation is correct.
Once you have completed the stubbed out functions and they are passing the unit tests, you are ready to begin designing the main program and completing any additional functions you may need.
Your program should produce output that looks like the sample run.
Use top-down design to think about the major steps the main program will need to complete (you do not need to send us your design). Then begin building the program incrementally. Here are some tips for how to proceed:
Python has a data structure called a dictionary that would work well for this. A dictionary stores keys and associates a value with them. For example:
d = {} #create an empty dictionary called d d['a'] = 1 #store the key 'a' with the value 1 d['b'] = 2 #store the key 'b' with the value 2 for key in d: #prints the keys and values in d print key, d[key]To keep track of the web page data, you can use a dictionary, where the keys are URLs and the values are sorted word, count lists.
You will need to show the results in sorted order from most relevant to least (be sure to exclude any results where the relevance score is 0). You are welcome to use python's built-in sort() method for lists. For example, if you create a results list structured like this:
[[4, "url0"], [2, "url1"], [0, "url2"], [3, "url3"]]Then calling results.sort() will put the list in ascending order from lowest relevance to highest like this:
[[0, "url2"], [2, "url1"], [3, "url3"], [4, "url0"]]You can then loop backwards through the list to report the best results first.
Here are instructions for how to create your own basic web page to add to the search engine.
cd
mkdir public_html
chmod a+rx public_html
cd public_html
emacs index.html
chmod a+r index.html
www.cs.swarthmore.edu/~yourUserName
webQueries.py
program that
will be used to retrieve web pages for the search engine. Then re-run
the program and try finding information that you included on your web
page.
Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.