A week ago, a friend of mine asked me if I could help him with parsing his old Excel files containing orders of Christmas trees. It was his seasonal business every end of a year for a long time, and old orders back to 2006 were kept in quite a free form so that I myself as a human had difficulties to grasp the structure of the worksheets. However, the goal was to export the data into a CRM system, to identify clients who made regular orders and to call them and try selling trees this year as well.

I have been working on a desktop app that will integrate my Spreads library, Excel, R and .NET coding into a single real-time universe of data, called Data Spreads. This work is still in progress, but the Christmas trees story forced me to pack at least the scripting feature I had and distribute it to my friend. So that instead of repacking an XLL add-in every time on my side, I could send him a C# script as a text file and interactively (e.g. via a remote desktop) adjust its behavior if needed.

The scripting feature was “nice to have” initially for Data Spreads. I started playing with the CShell, but it was rather overloaded from GUI and internal modular architecture perspectives, yet the editing experience was not very great compared to what Visual Studio or VS Code offers. Then I stumbled upon RoslynPad, and it surprised me with its elegant simplicity and very pleasant editing experience powered by Roslyn. At some point, I found myself keeping it open for writing small snippets and using it instead of the heavyweight VS for quick data parsing and similar tasks. Integrating RoslynPad with Excel was quite easy with another great open-source library ExcelDna. All I had to do was to rewrite the RP hosting functionality to run in the same process and AppDomain where Excel add-in runs, and hook up some events to register UDFs and macros marked with the relevant ExcelDna attributes. Then it just worked™☺

Christmas trees order parsing

Humans were entering the orders in plain Excel, without any common format of addresses or phone numbers even on the same worksheet. Therefore, the task of correctly extracting the addresses and names looked like a machine learning task. However, the business task was extracting the phone numbers, connecting related orders, and making an aggregate statistics by items and success status. In addition, it turned out that humans from a call center would be calling to the clients, and in every particular case they could easily read the address and confirm with the client if it is still valid. The entire parsing task reduced to extracting phone numbers, order dates and success statuses - the three things that couldn’t be done easily with Excel formulas and required some coding.

Orders sample 2006

Phone numbers

Moscow had optional phone prefix until recently, so the seven-digit numbers are +7 495 XXX XX XX now. The long-distance prefix 8 is optional, and by international standard should be +7. The usage of dashes, dots, parentheses and spaces was non-standard inside each sheet. To deal with this, I wrote a simple Sloppy Digit Pattern RegEx:

// SloppyDigitPattern: Since human entered text in a free form, we assume spaces between digits in any place (fat fingers)
// and very optional use of punctuation, e.g. '-', '.'    
public static string sdp = @"([\+-.\(]?\s*\d\s*[\)]?)";
public static Regex phoneRegex = new Regex($@"({sdp}{{7,11}})");

Then a simple function returns all found phone numbers in the format that the CRM system supports:

public static partial class Parser
{
    public static string Phones(string value)
    {
        return phoneRegex.Matches(value)
            .Cast<Match>().Select(m => m.Value)
            .Select(str => new String(str.Where(ch => Char.IsDigit(ch)).ToArray())) // keep only digits
            .Select(number =>
                {
                    if (number.Length == 11)
                    {
                        number = "+7" + number.Substring(1);
                    }
                    else if (number.Length == 10)
                    {
                        number = "+7" + number;
                    }
                    else if (number.Length == 7)
                    {
                        number = "+7495" + number;
                    }
                    else
                    {
                        return ""; // we only support 7, 10, and full 11 digit numbers
                    }
        
                    return number;
                })
            .Where(x => !string.IsNullOrWhiteSpace(x))
            .Select(number => number.Insert(2, " ").Insert(6, " ").Insert(10, " ").Insert(13, " ")) // comment to remove spaces
            .Aggregate("", (acc, st) => acc + "[ " + st + " ] ").Trim();

    }
}

Finally, we wrap the static method in another method that is marked with ExcelDna’s ExcelFunction attribute.

const bool defaultVolatile = true;

[ExcelFunction(IsVolatile = defaultVolatile)]
public static object ParsePhone(string input = "")
{
    return string.IsNullOrWhiteSpace(input) ? "" : Parser.Phones(input);
}

Now after running the script (F5) we have the function ParsePhone available from Excel or could debug it directly from the script editor. The editor pretty-prints the last variable in a script (placed without semicolon), or any variable using the Dump() extension method, e.g. matches.Dump();. Interestingly, the Dump method works when a function is called from Excel and prints variables in the same way, which is nice for debugging.

Debug dump

Delivery date

Delivery dates were put as a header, once per day and not per order, e.g. на 08.12.07г. , суббота on the picture. Instead of writing C# code that will do something like “for every row where the address and quantity are not empty, move above until the quantity is empty, parse the cell in the same column as addresses, and for those orders apply that parsed date”, I wrote that logic using Excel functions. The task became very similar to phone parsing:

public static Regex deliveryDateRegex = 
    new Regex(@"(?<date>\d{1,4})(\s*[.]?\s*)(?<month>(\d{1,4}|\p{L}+))(\s*[.]?\s*)?(?<year>\d{1,4})?");

public static DateTime? DeliveryDate(string value, int year = 0)
{
    try
    {
        var match = deliveryDateRegex.Matches(value).Cast<Match>().First();
        var date = int.Parse(match.Groups["date"].Value);
        //date.Dump("date");
        var monthString = match.Groups["month"].Value;
        int month = 0;
        if (!int.TryParse(monthString, out month))
        {
            if (monthString.ToLowerInvariant().Contains("дек"))
            {
                month = 12;
            }
            if (monthString.ToLowerInvariant().Contains("ноя"))
            {
                month = 11;
            }
            if (monthString.ToLowerInvariant().Contains("окт"))
            {
                month = 10;
            }
        }
        //month.Dump("month");
        if (month == 0) return null;
        if (year == 0) return null;
        //year.Dump("year");
        var dt = new DateTime(year, month, date);
        return dt;
    }
    catch (Exception)
    {
        return null;
    }
}

[ExcelFunction(IsVolatile = defaultVolatile)]
public static object ParseDeliveryDate(string input, int year = 0)
{
    var dt = Parser.DeliveryDate(input, year);
    return dt == null ? (object)"" : dt.Value.ToOADate();
}

Order status

Order status is determined by background or font color. The simple rule is that when any of those colors have R value greater that G value in the RGB representation, the order status is unsuccessful.

#r "System.Drawing"

using ExcelDna.Integration;
using Microsoft.Office.Interop.Excel;

public static Range ReferenceToRange(ExcelReference xlRef)
{
    var xlRange = (ExcelDnaUtil.Application as Application).Evaluate(XlCall.Excel(XlCall.xlfReftext, xlRef, true)) as Range;
    return xlRange;
}

[ExcelFunction(IsVolatile = defaultVolatile, IsMacroType = true)]
public static async Task<int> Status([ExcelArgument(AllowReference = true)]object input)
{
    var xlRef = input as ExcelReference;
    var rng = ReferenceToRange(xlRef);
    try
    {
        var oleColor = (int)(double)rng.Interior.Color;
        //oleColor.Dump();
        var color = System.Drawing.ColorTranslator.FromOle(oleColor);
        if (color.R > color.G) return 0;
        oleColor = (int)(double)rng.Font.Color;
        color = System.Drawing.ColorTranslator.FromOle(oleColor);
        return (color.R > color.G) ? 0 : 1;
    }
    catch (Exception)
    {
        return -1;
    }
    finally
    {
        Marshal.ReleaseComObject(rng);
    }
}

Here we use COM interop and Microsoft.Office.Interop.Excel namespace to get the COM Range object from ExcelDna’s ExcelReference object, and then convert OLE colors to RGB ones using the System.Drawing.ColorTranslator.FromOle method. Note the usage of Marshal.ReleaseComObject for COM objects here: it is a good practice to use this method whenever user code received or created a COM object and no longer needs it (more on this in the next section). Status formula

Excel Model and XlResult

While quickly writing the parsing code above, I made a very common mistake that I was aware of, but still assigned a COM object to a static field and did not release a reference to it. That issue is described in details in this SO question. The simple rule of thumb is to never ever call C or COM Excel API outside of the main thread, never use 2 dots with com objects, do not store COM objects for longer than they are needed and always release them with Marshal.ReleaseComObject.

ExcelDna has a helper method to schedule any action on the main thread, but it doesn’t return a value. Also, if one uses the method when already on the main thread and tries to wait for a result using a wait handle or a TaskComletionSource, there will be a deadlock. Always following the rules in a safe way requires repeating the same boilerplate code, and I wrote a helper class that works similarly to Task, but always executes on the main Excel thread, is awaitable from C# async methods using the await keyword, supports composition of functions and adds very little overheads. The code is quite long to place it here inline and is available on GitHub. The main work happens in the GetResult method.

We then could rewrite the Status example above using XlResult and await keyword. C# could await anything that has GetAwaiter method, and we just use TaskComletionSource.Task for this.


public static XlResult<Range> ReferenceToRange(ExcelReference xlRef)
{
    var xlRange = new XlResult<Range>(() => (ExcelDnaUtil.Application as Application).Evaluate(XlCall.Excel(XlCall.xlfReftext, xlRef, true)) as Range);
    return xlRange;
}

[ExcelFunction(IsVolatile = defaultVolatile, IsMacroType = true)]
public static async Task<int> Status([ExcelArgument(AllowReference = true)]object input)
{
    var xlRef = input as ExcelReference;
    // we could await XlResult from any thread safely
    var rng = await ReferenceToRange(xlRef);
    // same code for Status...
}

The COM issue could quickly become nasty. At some point, I even wrote an F# computation expression that did manual reference counting behind the scenes, but it didn’t work reliably 100% of times. Therefore, instead of reference counting, I added a Map method that could chain different functions together and guarantee that COM objects and C/COM APIs are never accessed outside the main thread. E.g. in this example, we chain three functions:

[ExcelFunction(IsMacroType = true)]
public static async Task<string> Hello([ExcelArgument(AllowReference = true)]object text) {
    var xlRef = text as ExcelReference;
    if (xlRef != null) {
        using (var result = XlResult.XlCall(XlCall.xlfReftext, xlRef, true)
            .Map(str => DSAddIn.XlApp.Evaluate((string)str) as Range)
            .Map(rng => rng.Value2.ToString())) {
            return await result;
        }
    }
    throw new Exception();
}

Accessing COM object and C API only from the main thread should work “as expected” even without explicit releasing via Marshal methods, as was written somewhere on ExcelDna forum/mailing list (cannot find a link). Using XlResult helps to stay on the main thread and follow the rules.

The ExcelModel library also contains some code from GitHub, where the author tried to achieve the similar result - to avoid using COM objects while working with a similar object model. However, it is not thread-safe now and I will gradually rewrite frequently used methods with the XlResult async pattern.

Install and usage

To install the Data Spreads app, please download Setup.exe or Setup.msi files*, which are packed with Squirrel and will automatically update when I add new functionality. When the Setup process finishes, you will have a tray icon with three commands: Excel Add-In, which starts Excel and loads the add-in, Script Editor, which loads a standalone editor, and Exit.

Tray icon

In Excel, you will have Data Spreads tab with a Script Editor button. You may find a basic self-explanatory sample and the Christmas trees sample in the Samples\Data Spreads\Excel folder. An Excel file that actually uses the functions from the Christmas sample is here.

Ribbon

If you get some error related to Excel version or bitness, you may find the .xll files in the C:\Users\%USERNAME%\AppData\Local\DataSpreads\app-X.Y.Z\Excel directory and start the one that corresponds to your Excel bitness (if you do not know it, just try both DataSpreads[32/64].xll until it works).

Below are the simplest user-defined functions that return the same $"Hello, {name}!" text. The difference is that the first one will block Excel UI during its execution, while the second one will return #NA until the async result is available but will keep the UI responsive. The async version is the best option for long-running functions with heavy calculations or IO operations, and could be easily used with awaitable XlResult:

[ExcelFunction(IsVolatile = false, IsThreadSafe = true, IsClusterSafe = true, Name = "Hello")]
public static object Hello(
[ExcelArgument(Name = "NAME", AllowReference = true, Description = "Name for greetings")] string name)
{
    if (string.IsNullOrWhiteSpace(name)) name = "Data Spreads";
    // this will block Excel UI
    Thread.Sleep(1000);
    return $"Hello, {name}!";
}

[ExcelFunction(IsVolatile = false, IsClusterSafe = true, Name = "HelloAsync")]
public static async Task<object> HelloAsync(
[ExcelArgument(Name = "NAME", AllowReference = true, Description = "Name for greetings")] string name)
{
    if (string.IsNullOrWhiteSpace(name)) name = "Data Spreads";
    // This will nicely return #NA and then the result after the delay, without blocking Excel UI
    await Task.Delay(1000);
    return $"Hello, {name}!";
}

Note the IsThreadSafe = true attribute parameter of the first function: when it is set to true the execution happens not on the main thread but on a dedicated calculation thread; but the UI is still blocked. With IsThreadSafe = false, the function will always run on the main thread, which could be seen by adding Thread.CurrentThread.ManagedThreadId to the return value. In the async case, the execution is never on the main thread but on a .NET thread-pool, and one should keep in mind the COM interop issues and use the XlResult pattern when working with COM/C APIs.

Contributing & roadmap

RoslynPad: I made almost no changes to the RoslynPad core functionality - its Common, Roslyn and RoslynEditor projects are practically unchanged. Therefore, all kudos and contributions related to the code editing experience should go to the original project. My changes to hosting and GUI are very small and do not deserve a separate project; instead, I will try to push them upstream.

ExcelDna: This is a very mature and well-documented project. The fact that I could use it from the script editor without any single change just proves its reliability. Please go to its repo for details.

ExcelModel: This is my small library for Excel utilities, with a goal to make a thread-safe API using .NET types over Excel’s C or COM API. Currently it contains XlResult<TResult> helper type that is described in the Excel Model section above, and some legacy code from GitHub. At first, I will add data dumping features, such as charts and tables, and utility functions scattered among my private projects (such as ReferenceToRange in the samples above). This library is referenced from all scripts when the editor is started from Excel, will be updated frequently and is open for any contributions!

My current focus is on other features of Data Spreads project and I am happy that I checked the box with Script Editor. I will continue to polish it and contribute to upstream repos, but it is already quite good for the real job of Excel scripting - thanks to the awesome open-source libraries. I do no plan to charge anything for offline features of Data Spreads such as this Script Editor, while similar solutions cost much more (and currently provide more features, e.g. http://fcell.io/ costs $1k per annum). My goal is to make the standalone application useful for number crunching and data science even without internet connection. So please stay tuned and share your feedback and features you need the most!

P.S. If you are in Moscow and need a Christmas tree, don’t think twice and just go to http://elki-shop.ru/, the shop has the best ones in the city and 10+ years track record with happy customers! ☺


* THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

The application is experimental work in progress, even though is already useful.