Thursday, November 28, 2013

On Excel Interoperation with F#

Digging into the Excel interoperation facilities of F#, example code like F# 3.0 Sample Pack, and projects like Excel F# Type Provider and Excel-DNA are first entry points for the topic. I was interested to learn, how these tools would be applied to the following problem: Feed data from an Excel sheet into a Deedle data frame in order to calculate a statistics on the data. 

From Excel Sheets to Deedle Data Frame

Deedle provides a CSV Type Provider out of the box, but I do not take a solution based on file format transformation into account for Excel files on a Windows platform. It seemed straight forward to me to enhance the example code of the Sample Pack and FSSnip to get this task done. Two sub-problems turned out to raise concerns:
  1. transfer the data from the Excel object into a F# data structure that could be coerced to a Deedle data frame,
  2. manage the life cycle of the Excel object properly while avoiding boilerplate code as much as possible.
I finally came up with a solution that abstracts the notion of an Excel workbook as a class which manages the Excel object life cycle by implementing the IDisposable interface and allows access to arbitrary sheets. The member functions allow access the Excel sheet data by indices (bound to 1) which relate to Excel range selectors in the "Row:Column" string format. The class constructor expects to receive a vaild path/file name combination, where the path may be ommited (use 'None' as parameter) and relative paths work fine as part of the file name.

The member function are implemented via external helper function which I chose not to include in the class. The first version 'getFrameWithStringHeader', which is a straightforward adaption of the FSSnip code example, is disapointingly slow because of the per-cell access mode. An improved version 'getFrameWithHeader' graps the Excel range in one piece. The complete source code is published on Github. The code would need further cleanup, but I think it reveals the interesting points properly.

I purposely coerced the Excel data in a semi-generic Frame of type 'int * 'a', where the column selector type is derived from the Excel input, but the row selector type is fixed. I do not think that this is a major restriction as, for example, time series data will most likely keep the time information in a separate series instead of using the time points as row indices.

The Excel workbook class handles errors in terms of returning 'None'. Exceptions in dealing with Excel and the Excel file are purposely captured. I think, I found the same idea on Scott Wlaschin's blog FSharpForFunAndProfit, which is a good read anyway. The Excel resource management works well as far as I can tell.

Porting Functional Code from Clojure to F#

It is a nice experience that porting code functional code from Clojure to F# does not raise specific pain. It is also interesting to learn by example where the difference between dynamically typed and strong typed strikes. I really appreciate the tool support by the compiler and debugger as a reward for the enforced type discipline which reduces the hammock and REPL time quite bit for me. You may want to look yourself at the code comparison here.