Try it yourself with our free Diff Checker tool — runs entirely in your browser, no signup needed.

How to Compare text and find differences in Ruby

How to compare text and find differences in Ruby

Comparing text and finding differences is a common task in many applications, such as text editors, version control systems, and data processing pipelines. In Ruby, there are several ways to achieve this, but some methods are more efficient and accurate than others. In this guide, we will explore a reliable and practical approach to comparing text and finding differences in Ruby.

Quick Example

require 'diff/lcs'

def compare_text(original, modified)
  diff = Diff::LCS.diff(original.split("\n"), modified.split("\n"))
  diff.map do |part|
    case part
    when Array
      part.join("\n")
    when Diff::LCS::Change
      "- #{part.old.join("\n")}\n+ #{part.new.join("\n")}"
    when Diff::LCS::Insert
      "+ #{part.element.join("\n")}"
    when Diff::LCS::Delete
      "- #{part.element.join("\n")}"
    end
  end.join("\n")
end

original_text = "This is the original text.\nIt has multiple lines."
modified_text = "This is the modified text.\nIt has multiple lines too."

puts compare_text(original_text, modified_text)

This code uses the diff/lcs gem to compare two strings and produce a human-readable diff output.

Step-by-Step Breakdown

Let's walk through the code line by line:

  1. require 'diff/lcs': We require the diff/lcs gem, which provides an efficient algorithm for computing the differences between two sequences.
  2. def compare_text(original, modified): We define a method compare_text that takes two arguments: original and modified, which are the two strings to be compared.
  3. diff = Diff::LCS.diff(original.split("\n"), modified.split("\n")): We split the input strings into arrays of lines using the split method, and then pass these arrays to the Diff::LCS.diff method to compute the differences. The diff variable now holds an array of Diff::LCS objects, which represent the differences between the two input strings.
  4. diff.map do |part| ... end: We use the map method to transform the diff array into a human-readable diff output.
  5. case part ... end: We use a case statement to handle different types of Diff::LCS objects:
    • Array: If the part is an array, it means that the corresponding lines are identical in both input strings. We simply join the array elements with newline characters using the join method.
    • Diff::LCS::Change: If the part is a Change object, it means that the corresponding lines have been modified. We print the old and new lines with a - and + prefix, respectively.
    • Diff::LCS::Insert: If the part is an Insert object, it means that the corresponding lines have been inserted in the modified string. We print the inserted lines with a + prefix.
    • Diff::LCS::Delete: If the part is a Delete object, it means that the corresponding lines have been deleted in the modified string. We print the deleted lines with a - prefix.
  6. end.join("\n"): Finally, we join the transformed diff array elements with newline characters using the join method.

Handling Edge Cases

Here are some common edge cases to consider:

Empty/null input

If either the original or modified input string is empty or null, the compare_text method will raise an error. To handle this case, we can add a simple null check at the beginning of the method:

def compare_text(original, modified)
  return "" if original.nil? || modified.nil?
  # ...
end

Invalid input

If the input strings contain invalid characters, such as null bytes or invalid Unicode sequences, the compare_text method may raise an error. To handle this case, we can use the force_encoding method to ensure that the input strings are encoded in a valid encoding:

def compare_text(original, modified)
  original = original.force_encoding("UTF-8")
  modified = modified.force_encoding("UTF-8")
  # ...
end

Large input

If the input strings are very large, the compare_text method may consume a lot of memory and CPU resources. To handle this case, we can use a streaming approach to compute the differences, rather than loading the entire input strings into memory at once:

def compare_text(original, modified)
  original_io = StringIO.new(original)
  modified_io = StringIO.new(modified)
  diff = Diff::LCS.diff(original_io.each_line, modified_io.each_line)
  # ...
end

Unicode/special characters

If the input strings contain Unicode characters or special characters, such as tabs or newline characters, the compare_text method may produce incorrect results. To handle this case, we can use the unicode_utils gem to normalize the input strings before computing the differences:

require 'unicode_utils'

def compare_text(original, modified)
  original = UnicodeUtils.normalize(original, :nfc)
  modified = UnicodeUtils.normalize(modified, :nfc)
  # ...
end

Common Mistakes

Here are some common mistakes to avoid:

1. Using == for string comparison

Using the == operator for string comparison can lead to incorrect results, because it compares the strings byte-by-byte, rather than character-by-character.

# Wrong
original == modified

# Correct
original.chars == modified.chars

2. Using gsub for string substitution

Using the gsub method for string substitution can lead to incorrect results, because it replaces substrings greedily, rather than lazily.

# Wrong
original.gsub("old", "new")

# Correct
original.gsub(/old/, "new")

3. Using split without specifying the separator

Using the split method without specifying the separator can lead to incorrect results, because it splits on whitespace characters by default.

# Wrong
original.split

# Correct
original.split("\n")

Performance Tips

Here are some performance tips for comparing text and finding differences in Ruby:

1. Use a streaming approach

Using a streaming approach can reduce memory usage and improve performance when comparing large input strings.

def compare_text(original, modified)
  original_io = StringIO.new(original)
  modified_io = StringIO.new(modified)
  diff = Diff::LCS.diff(original_io.each_line, modified_io.each_line)
  # ...
end

2. Use a caching layer

Using a caching layer can improve performance by reducing the number of times the compare_text method is called.

require 'dalli'

def compare_text(original, modified)
  cache = Dalli::Client.new
  cache_key = "compare_text_#{original}_#{modified}"
  if cache.get(cache_key)
    return cache.get(cache_key)
  else
    diff = Diff::LCS.diff(original.split("\n"), modified.split("\n"))
    cache.set(cache_key, diff)
    return diff
  end
end

3. Use a parallel processing approach

Using a parallel processing approach can improve performance by computing the differences in parallel.

require 'parallel'

def compare_text(original, modified)
  Parallel.map([original, modified], in_threads: 2) do |text|
    Diff::LCS.diff(text.split("\n"), text.split("\n"))
  end
end

FAQ

Q: What is the best way to compare text in Ruby?

A: The best way to compare text in Ruby is to use the diff/lcs gem, which provides an efficient algorithm for computing the differences between two sequences.

Q: How can I handle empty or null input strings?

A: You can handle empty or null input strings by adding a null check at the beginning of the compare_text method.

Q: How can I handle large input strings?

A: You can handle large input strings by using a streaming approach or a caching layer.

Q: How can I handle Unicode characters or special characters?

A: You can handle Unicode characters or special characters by using the unicode_utils gem to normalize the input strings before computing the differences.

Q: What are some common mistakes to avoid when comparing text in Ruby?

A: Some common mistakes to avoid when comparing text in Ruby include using == for string comparison, using gsub for string substitution, and using split without specifying the separator.

AI agent tools available. The CodeTidy MCP Server gives Claude, Cursor, and other AI agents access to 60+ developer tools. One command: npx @codetidy/mcp