Similarity

Analyze a pair of text samples

Notes on the method

I was given the following challenge:
Write a program that takes as inputs two file paths and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.

For the solution I avoided libraries or known algorithms and made a naive back of the envelope solution. I reviewed the main strategy I learned at Hunter College for DNA testing, the Needleman-Wunsch algorithm, mostly because I couldn't quickly figure out how to implement it. So my naive strategy is this:

  1. Take the two input strings, remove punctuation, spaces, and any non alpha chars, to create a pair of word arrays
  2. Tally any members of List 1 that do not occur in list 2.
  3. Repeat the orphan tally but in the reverse direction, tallying any members of List 1 that do not occur in list 2.
  4. Return a similiarity score: take the inverse of the average of orphans/allwords. So on a scale of 0 to 1, 1 represents utter disimilarity and 0 represents complete identity.

To see my solution in detail, check out its repository. To see the code running with comments, open the console of your browser, and then reload this page.

Analyze A versus B:

Text A

12345

Text B

12345

Similarity

12345

Analyze B versus C:

Text B

12345

Text C

12345

Similarity

12345

Analyze A versus C:

Text A

12345

Text C

12345

Similarity

12345

Analyze C versus C (should give perfect score):

Text C

12345

Text C

12345

Similarity

12345

Analyze A versus 'FOO BAR BAZ':

Text A

12345

12345

Similarity

12345

Analyze "foo foo foo" versus "FOO BAR BAZ":

Text

12345

Text

12345

Similarity

12345

Discussion

Some satisfying improvements to this project could include the following: