Keywords

DRAFT

Recently while browsing through 'answer' list on quora.com, I stumbled upon a question:

About how many keywords do most programming languages contain?

While at first the question seemed uninteresting and dry, after giving it a thought, the idea of finding number of keywords in languages seemed fun to me. And so began my mini adventure of determining arithmetic mean of number of keywords in programming languages.

Note: I have described the whole process of how I calculated the mean here, in case you are interested just in glancing over the result, head over to the end of page.

Word Count

The first thought that came to my mind was to get a file containing all the keywords in a language and pass it to wc -w to get the word count. And to procure such files, all I needed was to get my hands on the syntax highlighting scripts for any one of the text editors. After exploring mode files for Emacs¹ and Vim scripts², it was clear that I was better off searching for scripts for an editor which doesn't place much logic into its syntax files or which were easily parsable. Fortunately I found a source of such files easily on the web – IDMComputerSolutions/wordfiles

Parsing Wordfiles

The wordfiles from UltraEdit project are of the format:

/C<d - 1>"other stuff"
...
/C<d>"keyword"
Keyword1
Keyword2
...
/C<d + 1>"other stuff"
...

I was interested in capturing only the section starting with "keyword", "reserved word" or "command". sed appeared to be the perfect tool for this job, it just had to pipe the lines to wc starting with such phrases until the next section. Here are the sed commands used:

sed -n -e "/[Kk][Ee][Yy][Ww][Oo][Rr][Dd]/, /\/C[0-9]/p" \
       -e "/[Rr][Ee][Ss][Ee][Rr][Vv][Ee][Dd]/, /\/C[0-9]/p" \
       -e "/[Cc][Oo][Mm][Mm][Aa][Nn][Dd]/, /\/C[0-9]/p"
# unfortunately sed on OS X does not recognize Ignorecase flag, hence the ugly [Cc]
# workaround

Unfortunately, there was one small problem, using sed I was unable to do a non greedy match for the ending pattern. I wrote a small Python script capture.py to do the job and a awk one liner to calculate the arithmetic mean.

One last thing

The complete repository of wordfiles contains over 600 files which contains several esoteric languages/DSL which have abnormally high number of reserved keywords and some which contains no reserved words. Some of the languages are obsolete and some are of very little interest to general public. Including every language in calculations severely disturbed the mean, taking it to a very high number. Hence I picked around 60 well known languages and calculated the mean.

#!/bin/bash

interestingFiles=('JSON' 'acl' 'actionscript' 'ada95' 'c' 'c-winapi'
                  'c-os2' 'csharp' 'e' 'eiffel' 'escript' 'euphoria' 'falcon'
                  'fortran' 'fortran90' 'fortran95' 'go' 'groovy' 'hugo'
                  'java122' 'java13' 'java14' 'java14jsp' 'javascript' 'js20'
                  'jscript' 'jsp' 'lua' 'lull' 'maple' 'modula' 'modula2' 'oracle11g'
                  'pascal' 'perl' 'php5' 'pl1' 'plsql' 'python26' 'python35'
                  'qbasic' 'ruby' 'rust' 'sap' 'schema' 'scheme' 'small' 'sql'
                  'swift' 'symbian' 'tcl-tk' 'turboc' 'uniscript' 'vb' 'vbdotnet'
                  'vbscript' 'verilog2001' 'vhdl93' 'yaml' 'zillions')

#for filename in ./wordfiles/*.uew; do
for filename in ${interestingFiles[@]}; do
    echo -n "$filename"
    ./capture.py ./wordfiles/"$filename".uew | wc -w
done

Result

Piping everything together, the average number of keywords in 60 languages came out to be:

66.5333

Charts

View the graph in bigger size here.

Caveats

As it is evident from the results, the number of words to be highlighted do not accurately reflect the actual number of keywords. For example, keywords are vaguely defined in scheme and depend on the implementation. One can redefine almost every keyword which is defined during initialization. Thus please consider the result as an outcome of a, fun filled, hack done in couple of hours, rather than a research experiment implemented with rigorous and diligent efforts.

Footnotes:

in case you are curious to see one, they are located under progmodes directory inside Emacs folder, which would be /Applications/Emacs.app/Contents/Resources/lisp/progmodes if you are on a Mac.

located under /Applications/MacVim.app/Contents/Resources/vim/runtime/syntax