Students: Vrushti Patel

Project: Revitalization of Endangered Languages using Machine Learning | View Poster (PDF)

Major: Computer Science and Business

Advisor: Maryam Rahnemoonfar

Abstract

Currently, there are about 7000 languages spoken in the world. Of these, around 3000 languages are endangered, most of which are indigenous languages1. Our research specifically focuses on the Cherokee language, which has only about 2000 fluent first-language speakers. Such extensive loss of languages is extremely alarming because they are heavily tied to culture and hold decades' worth of knowledge. A loss of a language would mean the loss of the culture and history tied to that language.

However, such low-resource languages tend to have less linguist data available, making it difficult to work with them using traditional language models. In fact, the current available open-source Cherokee-to-English parallel dataset only has about 16,000 sentences. Despite the lack of a large dataset, training a Cherokee-to-English language model through OpenNMT-py yielded a BLEU score of 35, indicating a potential for an effective and accurate machine translation model.

Our research also focuses on exploring avenues to enlarge the pre-existing dataset. The team has employed Optical Character Recognition to retrieve parallel data from Cherokee children’s books and other literature. Another avenue is to use back translation to translate Cherokee monolingual data into English to produce synthetic Cherokee-to-Engish parallel data. NLP is a very promising tool for preserving and revitalizing endangered languages despite the challenges regarding the lack of data. Methods like machine translation and OCR can help preserve languages and their literature, and allow communities to protect valuable knowledge, history, and culture.

Vrushti Patel

About Vrushti Patel

Vrushti Patel is a second-year student studying Computer Science and Business, minoring in Data Science and Global Studies. Through the Clare Booth Luce Research Scholarship, Patel had the opportunity to conduct research under Professor Rahnemoonfar at Bina Lab. Patel's research team focuses on revitalizing endangered languages using natural language processing. Apart from Computer Science and Data Science, she is very passionate about global citizenship. At Lehigh, Patel is of the Global Citizenship Program where students from different backgrounds come together to discuss global themes. She is also a Global Social Impact Fellow for the AISHA team. AISHA aims to bridge the gap in medical knowledge through the conversational capabilities of Amazon Alexa. Lastly, Patel is also a Career Intern and TRAC Fellow and loves meeting other Lehigh students and helping them professionally and academically.