Keywords
Recently while browsing through 'answer' list on quora.com, I stumbled upon a question:
About how many keywords do most programming languages contain?
While at first the question seemed uninteresting and dry, after giving it a thought, the idea of finding number of keywords in languages seemed fun to me. And so began my mini adventure of determining arithmetic mean of number of keywords in programming languages.
Note: I have described the whole process of how I calculated the mean here, in case you are interested just in glancing over the result, head over to the end of page.
Word Count
The first thought that came to my mind was to get a file containing all the keywords
in a language and pass it to wc -w
to get the word count. And to procure such
files, all I needed was to get my hands on the syntax highlighting scripts for
any one of the text editors. After exploring mode files for Emacs1 and
Vim scripts2, it was clear that I was better off searching for scripts for an
editor which doesn't place much logic into its syntax files or which were easily
parsable. Fortunately I found a source of such files easily on the web –
IDMComputerSolutions/wordfiles
Parsing Wordfiles
The wordfiles from UltraEdit project are of the format:
/C<d - 1>"other stuff" ... /C<d>"keyword" Keyword1 Keyword2 ... /C<d + 1>"other stuff" ...
I was interested in capturing only the section starting with "keyword", "reserved
word" or "command". sed
appeared to be the perfect tool for this job, it
just had to pipe the lines to wc
starting with such phrases until the next
section. Here are the sed
commands used:
sed -n -e "/[Kk][Ee][Yy][Ww][Oo][Rr][Dd]/, /\/C[0-9]/p" \ -e "/[Rr][Ee][Ss][Ee][Rr][Vv][Ee][Dd]/, /\/C[0-9]/p" \ -e "/[Cc][Oo][Mm][Mm][Aa][Nn][Dd]/, /\/C[0-9]/p" # unfortunately sed on OS X does not recognize Ignorecase flag, hence the ugly [Cc] # workaround
Unfortunately, there was one small problem, using sed
I was unable to do a
non greedy match for the ending pattern. I wrote a small Python
script capture.py
to
do the job and a awk
one liner to calculate the arithmetic mean.
One last thing
The complete repository of wordfiles contains over 600 files which contains several esoteric languages/DSL which have abnormally high number of reserved keywords and some which contains no reserved words. Some of the languages are obsolete and some are of very little interest to general public. Including every language in calculations severely disturbed the mean, taking it to a very high number. Hence I picked around 60 well known languages and calculated the mean.
#!/bin/bash interestingFiles=('JSON' 'acl' 'actionscript' 'ada95' 'c' 'c-winapi' 'c-os2' 'csharp' 'e' 'eiffel' 'escript' 'euphoria' 'falcon' 'fortran' 'fortran90' 'fortran95' 'go' 'groovy' 'hugo' 'java122' 'java13' 'java14' 'java14jsp' 'javascript' 'js20' 'jscript' 'jsp' 'lua' 'lull' 'maple' 'modula' 'modula2' 'oracle11g' 'pascal' 'perl' 'php5' 'pl1' 'plsql' 'python26' 'python35' 'qbasic' 'ruby' 'rust' 'sap' 'schema' 'scheme' 'small' 'sql' 'swift' 'symbian' 'tcl-tk' 'turboc' 'uniscript' 'vb' 'vbdotnet' 'vbscript' 'verilog2001' 'vhdl93' 'yaml' 'zillions') #for filename in ./wordfiles/*.uew; do for filename in ${interestingFiles[@]}; do echo -n "$filename" ./capture.py ./wordfiles/"$filename".uew | wc -w done
Result
Piping everything together, the average number of keywords in 60 languages came out to be:
66.5333
Charts
View the graph in bigger size here.
Caveats
As it is evident from the results, the number of words to be highlighted do
not accurately reflect the actual number of keywords. For example, keywords
are vaguely defined in scheme
and depend on the implementation. One can
redefine almost every keyword which is defined during initialization. Thus please
consider the result as an outcome of a, fun filled, hack done in couple of hours,
rather than a research experiment implemented with rigorous and diligent efforts.