Voice analysis with Machine Learning (AI)

Education Technology (Ed Tech) - using deep learning to analyse pronunciation 

“Pronunciation is definitely the biggest thing that people notice when you are speaking English."

Tomasz P. Szynalski


Machine learning and Natural Language Processing (NLP) company accenteasy shared their expertise with Vietnam's Education Ministers to assist with Project 2020's objective to improve English language in their country. Government ministers of education for Singapore, Malaysia and Vietnam want their citizens to be able to speak English with "the same accent to enable clear and effective communication." 

Those who speak English in these countries generally possess good grammar but speaking English with regional phonetic variations often makes understanding each other difficult: verbal proficiency was hindering clear communication. Habit and familiarity encourages use of phonemes from their own native language when speaking English, rather than use of less familiar phonemes found in Received Pronunciation (RP) from Britain or GenAm (American English).

An application was created to harvest words spoken with a British (RP) accent and used deep learning to classify features in the digital signals of these words. Machine learning was used to develop the ultimate version which can identify when speakers veer away from the RP accent.

The application is housed centrally in Singapore on Amazon AWS servers, using Hadoop (map reduce) to recruit servers, as and when required, thus minimising costs. This system was exposed as an API for:
  • Android and iPhone apps
    • analyser that assesses and identifies parts of words the user has difficulty with
    • country specific lessons (designed by PhD linguists) 
  • Edu Tech
    • gamification for phoneme learning
    • interactive arcade game using voice analysis to open doors and solve puzzles
  • Teacher console
    • designing lessons
    • connects apps into classes of students 
    • automatically marks and tracks homework
  • API interface
    • A JSON interface for demographics extraction from voice 
      • education, past travel
      • physical characteristics, gender, height
      • emotion