The Chars74K dataset

by Teofilo E. de Campos,Bodla Rakesh Babu,Manik VarmaUnknown

The Chars74K dataset

Character recognition is a classic pattern recognition problem for which researchers have worked since the early days of computer vision. With today's omnipresence of cameras, the applications of automatic character recognition are broader than ever. For Latin script, this is largely considered a solved problem in constrained situations, such as images of scanned documents containing common character fonts and uniform background. However, images obtained with popular cameras and hand held devices still pose a formidable challenge for character recognition. The challenging aspects of this problem are evident in this dataset. In this dataset, symbols used in both English and Kannada are available. In the English language, Latin script (excluding accents) and Hindu-Arabic numerals are used. For simplicity we call this the "English" characters set. Our dataset consists of: 64 classes (0-9, A-Z, a-z) 7705 characters obtained from natural images 3410 hand drawn characters using a tablet PC 62992 synthesised characters from computer fonts This gives a total of over 74K images (which explains the name of the dataset). The compound symbols of Kannada were treated as individual classes, meaning that a combination of a consonant and a vowel leads to a third class in our dataset. Clearly this is not the ideal representation for this type of script, as it leads to a very large number of classes. However, we decided to use this representation for our baseline evaluations present in [deCampos et al] as a way to evaluate a generic recognition method for this problem.

Dataset Attributes

Label SVG
TasksClassification
Label SVG
CategoriesWord, Language, Characters, Writing
Label SVG
SensorRGB Camera