Exforsys
+ Reply to Thread
Results 1 to 7 of 7

Automatic language detection

This is a discussion on Automatic language detection within the Software Patterns forums, part of the Testing category; I'm looking for tools/libraries/... to automatically detect the language (i.e.: english, french, italian, german, ...) of a given text document. ...

  1. #1
    Marco Barulli Guest

    Automatic language detection

    I'm looking for tools/libraries/... to automatically detect the
    language (i.e.: english, french, italian, german, ...) of a given text
    document.

    This tool need to be applied to a very large set of text documents,
    hence efficiency is not a trivial part of the solution.

    Many thanks for your help or suggestions,
    Marco



  2. #2
    Richard Owlett Guest

    Re: Automatic language detection

    Marco Barulli wrote:

    > I'm looking for tools/libraries/... to automatically detect the
    > language (i.e.: english, french, italian, german, ...) of a given text
    > document.
    >
    > This tool need to be applied to a very large set of text documents,
    > hence efficiency is not a trivial part of the solution.
    >
    > Many thanks for your help or suggestions,
    > Marco


    I think you've left out a *major* portion of your problem definition.

    This forum [comp.speech.research] generally seems to deal with "voice
    recognition".
    As stated, your problem seems to deal with printed/printable text.

    Before making fool of myself, I'll let you respond.

    You REALLY should set a 'follow up'.






  3. #3
    Marco Barulli Guest

    Re: Automatic language detection

    Richard Owlett <rowlett@atlascomm.net> wrote in message news:<10er3r3re462k3c@corp.supernews.com>...
    >
    > I think you've left out a *major* portion of your problem definition.
    >
    > This forum [comp.speech.research] generally seems to deal with "voice
    > recognition".
    > As stated, your problem seems to deal with printed/printable text.


    Yes, I'm talking about printable text, more specifically a large set
    of plain Unicode documents.

    I thought that comp.speech.research could be a suitable place to post
    my inquiry since I guess that, in a text-to-speech process, you might
    need to detect the language of the text in order to pronounce it
    correctly.

    Thanks for your help,
    Marco



  4. #4
    Tristan Miller Guest

    Re: Automatic language detection

    Greetings.

    In article <7f2f3b74.0407072339.c5d2700@posting.google.com>, Marco Barulli
    wrote:
    > I'm looking for tools/libraries/... to automatically detect the
    > language (i.e.: english, french, italian, german, ...) of a given text
    > document.
    >
    > This tool need to be applied to a very large set of text documents,
    > hence efficiency is not a trivial part of the solution.
    >
    > Many thanks for your help or suggestions,


    No cites, but I have heard that a simple statistical bigraph and trigraph
    models are very effective. If you have a limited number of languages
    which you can personally identify, then it shouldn't be difficult for you
    to produce the necessary data yourself.

    Try a Google search for "language detection" and "bigraph"; you should turn
    up some papers.

    Regards,
    Tristan

    --
    _
    _V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
    / |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
    (7_\\ http://www.nothingisreal.com/ >< To finish what you



  5. #5
    Jerry Wolf Guest

    Re: Automatic language detection

    Using the search term "language identification" instead of "detection"
    will get you a lot, and boith that and "bigraph" will get you a much
    more focused set of citations.

    Tristan Miller <psychonaut@nothingisreal.com> wrote in message news:<1886733.TpXR0R6JqR@ID-187157.News.Individual.NET>...
    > Greetings.
    >
    > In article <7f2f3b74.0407072339.c5d2700@posting.google.com>, Marco Barulli
    > wrote:
    > > I'm looking for tools/libraries/... to automatically detect the
    > > language (i.e.: english, french, italian, german, ...) of a given text
    > > document.
    > >
    > > This tool need to be applied to a very large set of text documents,
    > > hence efficiency is not a trivial part of the solution.
    > >
    > > Many thanks for your help or suggestions,

    >
    > No cites, but I have heard that a simple statistical bigraph and trigraph
    > models are very effective. If you have a limited number of languages
    > which you can personally identify, then it shouldn't be difficult for you
    > to produce the necessary data yourself.
    >
    > Try a Google search for "language detection" and "bigraph"; you should turn
    > up some papers.
    >
    > Regards,
    > Tristan




  6. #6
    Marco Barulli Guest

    Re: Automatic language detection

    Thanks to all for the very kind, prompt and effective help!
    Marco



  7. #7
    Art Pollard Guest

    Re: Automatic language detection


    "Marco Barulli" <marco.barulli@gmail.com> wrote in message
    news:7f2f3b74.0407072339.c5d2700@posting.google.com...
    > I'm looking for tools/libraries/... to automatically detect the
    > language (i.e.: english, french, italian, german, ...) of a given text
    > document.


    You may want to check out :

    http://www.lextek.com/
    http://www.languageidentifier.com/

    Of course, since I work for Lextek, I'm a bit biased....

    -Art

    --
    Lextek International
    http://www.lextek.com/
    Suppliers of High Performance Text Search and Retrieval Engines





    •    Sponsored Ads



Latest Article

Network Security Risk Assessment and Measurement

Read More...