Fast and Easy Levenshtein distance using a Trie (2011)

(stevehanov.ca)

101 pontos | por sebg 19 dias atrás

12 comentários

  • Mithriil
    15 dias atrás
    The Google's n-gram dataset link is outdated. You can get them here: https://storage.googleapis.com/books/ngrams/books/datasetsv3...
  • pmarreck
    14 dias atrás
    I once wrote an algorithm to generate a regex to find all matches from a given word with a levenshtein distance of 1 (I did not permute it beyond that though). Can link it if someone is curious.
  • kristianp
    15 dias atrás
    The author also has an interesting discussion of Succinct Data Structures at https://stevehanov.ca/blog/succinct-data-structures-cramming...
  • dvh
    15 dias atrás
    I made myself plugin that shows new news in wikipedia's current event page and I was using levenshtein originally (they often edit couple of words in article over span of few days so I compare each new article with previous ones for similarity) but after few days it became too slow (~20s) because O(m*n), so I switched to sorensen-dice instead which is O(m+n) and it's much faster and works very similar way, even tho they do slightly different thing.
  • localhoster
    15 dias atrás
    This article surface every once in a while, and I love it. What the author suggests is very clever. I have implemented an extended version of that in Go as an experiment. Instead of using a trie, I used a radix tree. Functions the same, but it's much more compressed (and faster).
    • pstuart
      15 dias atrás
      care to share the repo or is it private?
  • kelseydh
    15 dias atrás
    I needed a fuzzy string matching algorithm for finding best name matches among a candidate list. Considered Normalized Levenshtein Distance but ended up using Jaro-Winkler. I'm curious if anybody has good resources on when to use each fuzzy string matching algorithm and when.
    • vintermann
      15 dias atrás
      Levenshtein distance is rarely the similarity measure you need. Words usually mean something, and it's usually the distance in meaning you need.

      As usual, examples from my genealogy hobby: many sites allow you to upload your family tree as a gedcom file and compare it to other people's trees or a public tree. Most of these use Levenshtein distance on names to judge similarity, and it's terrible. Anne Nilsen and Anne Olsen could be the same person, right? No!! These tools are unfortunately useless to me because they give so many false positives.

      These days, an embedding model is the way to go. Even a small, bad embedding model is better than Levenshtein distance if you care about the meaning of the string.

      • jppittma
        15 dias atrás
        It depends on if or not you're trying to correct for typos, or do something semantic. Also, embedding distance is much much more expensive.
    • RobinL
      15 dias atrás
      There's a section in the docs of our FOSS record linkage software that covers this: https://moj-analytical-services.github.io/splink/topic_guide...
    • leeoniya
      15 dias atrás
      Levenshtein distance is often a poor way to fuzzy match or rank. i suspect that in js, even the trie approach would incur significant GC/alloc thrashing or cost of building a huge trie index.

      i tried fuzzy matching using a cleverly-assembled regexp approach which works surprisingly well: https://github.com/leeoniya/uFuzzy

      • srean
        15 dias atrás
        I would argue the opposite, with the 'often' doing some heavy lifting.

        It is very likely that you have interacted with a Levenstein distance based spell corrector (with many modifications) and I have touched that code. Used well they can be very powerful.

  • krishnakantk876
    14 dias atrás
    Tries are great when your dictionary is fixed, but for dynamic datasets or finding similar pairs, BK-trees are often faster since they use the triangle inequality to skip most comparisons.
  • gregman1
    15 dias atrás
    2011
  • fergie
    15 dias atrás
    Very cool and satisfying.
  • consomida
    15 dias atrás
    Using a trie to calculate Levenshtein distance is such a clever optimization. Clear explanation and practical examples make it easy to understand and implement
  • sminchev
    15 dias atrás
    A few years ago, before the AI boom I needed to create a de-duplication app, as a PoC. To be able to compare fast millions of contact data and to search for the duplicates. The clients' approach was taking, in best case, a day to compare everything and generate a report.

    What we do was a combination of big data engine, like Apache Spark, a few comparison algorithms like Levenshtein, and ML. AI was not treated as an option to do such things at all! :)

    What we did was to use Apache Spark to apply the static algorithms, if we get confident results like less than 10% equality or more than 90% of equality, we treated those as sure signs for records be duplicated or not. Records that were somewhere in the middle, we sent to Machine Learning libraries for analysis. Of course some education was needed for statistical basis. And hard to be automatically analyzed, we placed in a report for human touch ;)

    We got relatively good results. It was a Scala based app, as far as I remember :)

    Now with AI, it is much more easy... And boring! :D No complexities, no challenges.

    • arnorhs
      15 dias atrás
      That's an interesting story, but I'm really at a loss for how this relates to the post you are commenting on.
      • sminchev
        15 dias atrás
        Thank you for your question. At least there is one person who shares opinion, when down-voting. This is good, because I know what I did wrong, and I highly respect any respectable feedback ;)

        I really hate, when people down-vote, without giving any feedback what they don't like.

        Levenshtein, in combination with Machine Learning and big data engines, like Apache Sparks, can do a good job comparing content as well ;)

        Wanted to share another approach, and ideas to people who are interested in comparing strings, doing fuzzy searches, and searching for duplicated content.

      • devmor
        15 dias atrás
        I think they just forgot to link their train of thought. I have also used Levenshtein distance for deduplication comparisons so I can guess where the story came from.