I was given the following challenge: Write a program that takes as inputs two file paths and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.
View my solution code here.
For the solution I avoided libraries or known algorithms and made a brute force answer. I reviewed the main strategy I learned at Hunter College for DNA testing, the Needleman-Wunsch algorithm, mostly because I couldn't quickly figure out how to implement it. So my naive strategy is this:
I did not implement a form page for a user to enter text live. The grey boxes below are populated by hard coded data as this page loads.
Some satisfying improvements to this project could include the following:
of, the, and 'a'
are penalized less than smooth, pelican, and coelecanth
. This is not great, since strange members like gnat, gnu, and ax
now become deemphasized content in the text.