Similarity

Analyze a pair of text samples

Notes on the method

I was given the following challenge:
Write a program that takes as inputs two file paths and uses a metric to determine how similar they are. Documents that are exactly the same should get a score of 1, and documents that don’t have any words in common should get a score of 0.

View my solution code here.

For the solution I avoided libraries or known algorithms and made a brute force answer. I reviewed the main strategy I learned at Hunter College for DNA testing, the Needleman-Wunsch algorithm, mostly because I couldn't quickly figure out how to implement it. So my naive strategy is this:

  1. Take the two input strings, remove punctuation, spaces, and any non alpha chars, to create a pair of word arrays
  2. Tally any members of List 1 that do not occur in list 2.
  3. Repeat the orphan tally but in the reverse direction, tallying any members of List 1 that do not occur in list 2.
  4. Return a similiarity score: take the inverse of the average of orphans/allwords. So on a scale of 0 to 1, 1 represents utter disimilarity and 0 represents complete identity.

I did not implement a form page for a user to enter text live. The grey boxes below are populated by hard coded data as this page loads.

Analyze A versus B:

Text A

12345

Text B

12345

Similarity

12345

Analyze B versus C:

Text B

12345

Text C

12345

Similarity

12345

Analyze A versus C:

Text A

12345

Text C

12345

Similarity

12345

Analyze C versus C (should give perfect score):

Text C

12345

Text C

12345

Similarity

12345

Analyze A versus 'FOO BAR BAZ':

Text A

12345

12345

Similarity

12345

Analyze "foo foo foo" versus "FOO BAR BAZ":

Text

12345

Text

12345

Similarity

12345

Discussion

Some satisfying improvements to this project could include the following: