Similarity

Notes on the method

I was given the following challenge:
Write a program that takes as inputs two file paths and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.

View my solution code here.

For the solution I avoided libraries or known algorithms and made a brute force answer. I reviewed the main strategy I learned at Hunter College for DNA testing, the Needleman-Wunsch algorithm, mostly because I couldn't quickly figure out how to implement it. So my naive strategy is this:

Take the two input strings, remove punctuation, spaces, and any non alpha chars, to create a pair of word arrays
Tally any members of List 1 that do not occur in list 2.
Repeat the orphan tally but in the reverse direction, tallying any members of List 1 that do not occur in list 2.
Return a similiarity score: take the inverse of the average of orphans/allwords. So on a scale of 0 to 1, 1 represents utter disimilarity and 0 represents complete identity.

I did not implement a form page for a user to enter text live. The grey boxes below are populated by hard coded data as this page loads.

Discussion

Some satisfying improvements to this project could include the following:

rewrite the logic to follow the Needleman-Wunsch algorithm, common in gene alignment + BLAST
Use a weighting in the score, considering words of less than 5 letters to count as less important in the scoring. So misaligned cases of of, the, and 'a'are penalized less than smooth, pelican, and coelecanth. This is not great, since strange members like gnat, gnu, and ax now become deemphasized content in the text.
use some sort of library in the weighting, not word length
add buttons to choose from possible texts to align.
sanitize inputs; allow user-pasted or uploaded text
deploy from a Node.js server
write with React.js
clean up the CSS borrowed from my old page

Analyze a pair of text samples

Notes on the method

Analyze A versus B:

Text A

Text B

Similarity

Analyze B versus C:

Text B

Text C

Similarity

Analyze A versus C:

Text A

Text C

Similarity

Analyze C versus C (should give perfect score):

Text C

Text C

Similarity

Analyze A versus 'FOO BAR BAZ':

Text A

Similarity

Analyze "foo foo foo" versus "FOO BAR BAZ":

Text

Text

Similarity

Discussion