
- Forum
- Testing
- Software Patterns
- Automatic language detection
Automatic language detection
This is a discussion on Automatic language detection within the Software Patterns forums, part of the Testing category; I'm looking for tools/libraries/... to automatically detect the language (i.e.: english, french, italian, german, ...) of a given text document. ...
-
07-08-2004, 03:39 AM #1Marco Barulli Guest
Automatic language detection
I'm looking for tools/libraries/... to automatically detect the
language (i.e.: english, french, italian, german, ...) of a given text
document.
This tool need to be applied to a very large set of text documents,
hence efficiency is not a trivial part of the solution.
Many thanks for your help or suggestions,
Marco
-
07-08-2004, 02:19 PM #2Richard Owlett Guest
Re: Automatic language detection
Marco Barulli wrote:
> I'm looking for tools/libraries/... to automatically detect the
> language (i.e.: english, french, italian, german, ...) of a given text
> document.
>
> This tool need to be applied to a very large set of text documents,
> hence efficiency is not a trivial part of the solution.
>
> Many thanks for your help or suggestions,
> Marco
I think you've left out a *major* portion of your problem definition.
This forum [comp.speech.research] generally seems to deal with "voice
recognition".
As stated, your problem seems to deal with printed/printable text.
Before making fool of myself, I'll let you respond.
You REALLY should set a 'follow up'.
-
07-09-2004, 03:05 AM #3Marco Barulli Guest
Re: Automatic language detection
Richard Owlett <rowlett@atlascomm.net> wrote in message news:<10er3r3re462k3c@corp.supernews.com>...
>
> I think you've left out a *major* portion of your problem definition.
>
> This forum [comp.speech.research] generally seems to deal with "voice
> recognition".
> As stated, your problem seems to deal with printed/printable text.
Yes, I'm talking about printable text, more specifically a large set
of plain Unicode documents.
I thought that comp.speech.research could be a suitable place to post
my inquiry since I guess that, in a text-to-speech process, you might
need to detect the language of the text in order to pronounce it
correctly.
Thanks for your help,
Marco
-
07-09-2004, 05:03 AM #4Tristan Miller Guest
Re: Automatic language detection
Greetings.
In article <7f2f3b74.0407072339.c5d2700@posting.google.com>, Marco Barulli
wrote:
> I'm looking for tools/libraries/... to automatically detect the
> language (i.e.: english, french, italian, german, ...) of a given text
> document.
>
> This tool need to be applied to a very large set of text documents,
> hence efficiency is not a trivial part of the solution.
>
> Many thanks for your help or suggestions,
No cites, but I have heard that a simple statistical bigraph and trigraph
models are very effective. If you have a limited number of languages
which you can personally identify, then it shouldn't be difficult for you
to produce the necessary data yourself.
Try a Google search for "language detection" and "bigraph"; you should turn
up some papers.
Regards,
Tristan
--
_
_V.-o Tristan Miller [en,(fr,de,ia)] >< Space is limited
/ |`-' -=-=-=-=-=-=-=-=-=-=-=-=-=-=-= <> In a haiku, so it's hard
(7_\\ http://www.nothingisreal.com/ >< To finish what you
-
07-09-2004, 12:43 PM #5Jerry Wolf Guest
Re: Automatic language detection
Using the search term "language identification" instead of "detection"
will get you a lot, and boith that and "bigraph" will get you a much
more focused set of citations.
Tristan Miller <psychonaut@nothingisreal.com> wrote in message news:<1886733.TpXR0R6JqR@ID-187157.News.Individual.NET>...
> Greetings.
>
> In article <7f2f3b74.0407072339.c5d2700@posting.google.com>, Marco Barulli
> wrote:
> > I'm looking for tools/libraries/... to automatically detect the
> > language (i.e.: english, french, italian, german, ...) of a given text
> > document.
> >
> > This tool need to be applied to a very large set of text documents,
> > hence efficiency is not a trivial part of the solution.
> >
> > Many thanks for your help or suggestions,
>
> No cites, but I have heard that a simple statistical bigraph and trigraph
> models are very effective. If you have a limited number of languages
> which you can personally identify, then it shouldn't be difficult for you
> to produce the necessary data yourself.
>
> Try a Google search for "language detection" and "bigraph"; you should turn
> up some papers.
>
> Regards,
> Tristan
-
07-11-2004, 05:25 PM #6Marco Barulli Guest
Re: Automatic language detection
Thanks to all for the very kind, prompt and effective help!
Marco
-
08-06-2004, 09:20 PM #7Art Pollard Guest
Re: Automatic language detection
"Marco Barulli" <marco.barulli@gmail.com> wrote in message
news:7f2f3b74.0407072339.c5d2700@posting.google.com...
> I'm looking for tools/libraries/... to automatically detect the
> language (i.e.: english, french, italian, german, ...) of a given text
> document.
You may want to check out :
http://www.lextek.com/
http://www.languageidentifier.com/
Of course, since I work for Lextek, I'm a bit biased....
-Art
--
Lextek International
http://www.lextek.com/
Suppliers of High Performance Text Search and Retrieval Engines
-
Sponsored Ads

Reply With Quote





