ThamizhiLIP v0.11

Logo

ThamizhiLIP: Thamizhi Linguistic Information Processing Library

View the Project on GitHub sarves/thamizhilip

ThamizhiLIP: Thamizhi Linguistic Information Processing Library

ThamizhiLIP, a python library, has the following functionalities:

All these have been developed on top of various tools and resources, including Stanza and foma. I have used both rule-based and deep learning approaches to create models that are used in thamizhilip.

How to use this library

  1. Install thamizhilip using pip: pip/pip3 install thamizhilip This will install all required dependencies, including stanza, which used to do the POS tagging and dependency parsing. After installing thamizhilip, you can start using it. You need python 3.6 or higher to install thamizhilip.
  2. Import thamizhilip to your python environment / python IDLE : from thamizhilip import tamil
  3. Download required models: tamil.downloadModels() This will download and store all the models and resources required for processing in your HOME directory.

You are done! You can use this to do POS and Morphological tagging, and Dependency Parsing.

POS tagging

There are several POS tagsets available for Tamil. Thamizhilip uses both University POS (UPOS) tagset and Amrita POS tagset. Therefore, you can get the tagging done using either tagset.

The following example shows a complete example for POS tagging.

from thamizhilip import tamil
tamil.downloadModels()

#Loading models, use either one of this:
#if you want to UPOS tag
mypos_model=tamil.loadModels("pos")
#or
#if you want to Amrita tag
mypos_model=tamil.loadModels("pos","amrita")

#POS tag data, you can feed a word or sentence
print(tamil.posTag("your Tamil data here",mypos_model)

Morphological Analysis

There are several tagsets available for morphological annotations. Thamizhilip uses its own tagset and Universal Feature inventory by Universal Dependencies (UFeat). Thamizhilip tagset is more granular than UFeat. The following example shows a complete example for Morphological tagging.

from thamizhilip import tamil
tamil.downloadModels()

#Morphological tagging, you need to feed a word at a time
#if you want to get the analysis using Thamizhilip tagset
print(tamil.morphTag("your Tamil word"))

#if you want to get the analysis using Universal Feature Set
print(tamil.morphTag("your Tamil word","ud"))

Dependency Parsing

ThamizhiLIP can parse a given sentence using Universal Dependency annotation scheme. The following example shows a complete example.

from thamizhilip import tamil
tamil.downloadModels()

#In order to use the dependency parser, you always need to load various models. 
depModel=tamil.loadModels()

#Then you can load them as shown below, 
#when parsing a sentence. Need to feed one sentence at a time. 
print(tamil.depTag("கண்ணன் அந்தப் புத்தகத்தைப் செய்தான்",depModel))

#for instance,
#>>> print(tamil.depTag("கண்ணன் அந்தப் புத்தகத்தை செய்தான்",depModels))
#would give you the following output:
#கண்ணன் அந்தப் புத்தகத்தைப் செய்தான்
#1|PROPN|nsubj|4
#2|DET|det|3
#3|NOUN|obj|4
#4|VERB|root|0
#5|PUNCT|punct|4

#As shown in the output above, output will have 4 columns. 
#1st column is a serial number
#2nd column UPOS
#3rd column [Dependency type](https://universaldependencies.org/u/dep/all.html)
#4th column depended word or token (its serial number is given)
 

Tamil Word validator

Apart from POS, Morph, and Dependency tagging, you can use the following script to see whether a given word is a Tamil word. Basically, this script verify the structure of a word against the word formation rules given a well-known Tamil grammar text called Nannool. https://github.com/sarves/thamizhi-preprocessor/

Cite

If you use this tool, please cite us:

Future work

A lot to be done, this is just a beginning. You can expect the following improvements in very near future:

For more information

You can find more information about these tools via the following sites:

Acknowledgment

I would like to express my appreciation to my supervisors Prof. Gihan Dias, and Prof. Miriam Butt for all their guidance. Further, I am thankful to the National Language Processing Centre, University of Moratuwa, Sri Lanka for providing all the facilities to build the current version of ThamizhiLIP.

In addition, I would also like to mention that this research was supported by the Accelerating Higher Education Expansion and Development (AHEAD) Operation of the Ministry of Higher Education, Sri Lanka funded by the World Bank, and also supported by the DAAD (German Academic Exchange Office)

Contact and support

Do you have any problems? Just reach me out! - @sarves Also feel free to fork and improve ThamizhiLIP. Also send me your valuable feedback so that I can improve this lib. This is released under Apache 2.0.