The Expressive C++17 Coding Challenge in D
You might have seen that I have been coding a lot in D lately and as a few weeks ago there was the Expressive C++17 Coding Challenge with its winner in C++ now being public, I thought this is an excellent opportunity to show why I like D so much.
The requirements
Let me first recap the requirements of this challenge:
This command line tool should accept the following arguments:
- the filename of a CSV file,
- the name of the column to overwrite in that file,
- the string that will be used as a replacement for that column,
- the filename where the output will be written.
./program <input.csv> <colum-name> <replacement-string> <output.csv>
Example input
Given this simple CSV as input
the program called with:
./program input.csv city London output.csv
should write:
Sounds fairly trivial, right? Please have a short look first at the “best” C++ and Rust solutions before you look at the D solution.
The solution
Okay, so here’s one way to solve this in D. If you are scared at the moment - don’t worry. I will explain it line by line below.
So how does this compare to C++17 and Rust?
Language | LoC | Words | Characters | Time |
---|---|---|---|---|
C++ | 125 | 365 | 2724 | 15s |
Rust | 83 | 231 | 1703 | 6s |
D | 17 | 84 | 710 | 12s |
D (slightly tweaked) | 25 | 98 | 794 | 5s (4s with LTO) |
CSV parsing is a lot more complicated than splitting by a delimiter and in the real world shouldn’t roll your own CSV parser. However, this article aims to analyze the expressive power of D in comparison to C++ and Rust by focusing on common example. Later in this article I will also present a solution with only 12 lines by using D’s built-in std.csv
module.
I used the following D code to generate a simple CSV file with 10 fields and 10m lines:
rdmd --eval='10.iota.map!(a => "field".text(a)).join(",")
.repeat(10_000_000).joiner(newline).writeln' > input_big.csv
The resulting input_big.csv
has a size of 668M.
LTO stands for link-time optimization and I included it to show that we can always tweak the performance with a few easy tricks. See the benchmarking section for more details.
Aren’t you concered that Rust is faster in this benchmark?
Not at all. The challenge was to write expressive code. When performance really matters D provides the same tools as C or C++ and D even supports native interoperability with C and most of C++.
In this example, however, I/O is the bottleneck and D provides a few convenience features like using locked file handles, s.t. accessing files is thread-safe by default, or supporting unicode input. However, it’s easy to opt out of such productivity features and use other tricks like memory mapped files. For the interested readers I have attached a slightly optimized version at the end.
In addition, if you are interested in performance, Jon Degenhardt (member of eBay’s data science team), has made an excellent performance benchmark between eBay’s tsv-utils and existing CSV/TSV processing tools written in C, Go, and Rust.
1) What is #!/usr/bin/env rdmd
?
One of the favorite aspects of D is that it has a blazingly fast compiler. Period. I can compile the entire D front-end of the compiler (~200 kLoC) in less than two seconds or the entire standard library with lots and lots of compile-time function evaluation and templates and > 300 kLoC in 5 seconds from scratch without any cache or incremental compilation.
This means that the compiler is almost as fast as an interpreter and rdmd
is the tool that allows handy usage as “pseudo-interpreted” language. You can invoke rdmd
with any file and it will automatically figure out all required files based on your dependencies and pass them to the compiler.
It’s very popular in the D community because for small scripts one doesn’t even notice that the program is compiled to real machine code under the hood. Also if the shebang header is added and the file is executable, D scripts can be used as if they would be script files:
./main.d input.csv city London output.csv
2) So you import a bunch of libraries. What do they do?
import std.algorithm, std.exception, std.format, std.range, std.stdio;
In short std.stdio
is for input and output, std.range
is about D’s magic streams called “ranges” and std.algorithm
abstracts on top of them and provides generic interfaces for a lot of sophisticated algorithms.
Moreover, std.exception
offers methods for working with exceptions like enforce
and finally std.format
bundles methods for string formatting.
Don’t worry - the functionality imported from these modules will be explained soon.
3) Your program has a main function. What’s so special about it compared to C or C++?
For starters, arrays in D have a length. Try:
Compared to C/C++ null-terminated strings and arrays, it won’t segfault. It would just throw a nice Error:
core.exception.RangeError@./main.d(10): Range violation
----------------
??:? _d_arrayboundsp [0x43b622]
prog.d:9 void main.foo(immutable(char)[][]) [0x43ac93]
prog.d:4 _Dmain [0x43ac67]
Oh so D performs automatic bounds-checking before accessing the memory. Isn’t that expensive?
It’s almost negligible compared to the safety it buys, but D is a language for everyone, so the people who want to squeeze out the last cycles of their processor can do so by simply compiling with -boundscheck=off
(for obvious reasons this isn’t recommended).
In D, strings are arrays too and there’s another nice property about D’s arrays. They are only a view on the actual memory and you don’t copy the array, but just the view of the memory (in D it’s called a slice).
Consider this example:
There many other things D has learned from C and C++. Walter has recently written a great article on how D helps to vanquish forever these bugs that blasted your kingdom which I highly recommend if you have a C/C++ background.
4) What’s up with this enforce
?
I have never seen the
~
operator before!
It’s the string concatenation (or more general array concatenation) operator. How often have you encountered code like a + b
and needed to know the types of a
and b
to know whether it’s a addition or concatenation?
Why don’t you use an if statement and terminate the program explicitly?
That’s valid D too. D allows a lot of different programming styles, but this article is intended to highlight a few specific D styles like enforce
.
enforce
is a function defined in std.exception
and throws an exception if its first argument has a falsy value.
Hmm, I looked at the documentation and saw this monster. I thought it simply throws an exception?
I don’t have the time to fully dive into D’s syntax, but auto
instructs the compiler infer the return type for you. This leads to the interesting Voldemort return types as they can’t be named by the user, but that’s a good topic for another article.
The next part looks a bit complicated (E : Throwable = Exception, T)
, but don’t worry yet. It means that E
is a template parameter which needs to inherit from Throwable
(the root of all exceptions), and is by default Exception
. T
is the template type of value
.
Wait. I just instantiated a template without specifying its template parameters?
Yes, the D compiler does all the hard work for you. The technical term is Implicit Function-Template Instantiation (IFTI). Of course, we could have instructed enforce
to throw a custom exception, but more on template instantiation later.
Alright. So this function takes a generic
value
and amsg
, but alazy string msg
?
lazy
is a special keyword in D and tells the compiler to defer the evaluation of an argument expression until is actually needed.
I don’t understand.
msg
seems to be a string concatentation of two strings. Isn’t this done before theenforce
is called?
"Invalid args.\n" ~ "./tool <input.csv> <colum-name> <replacement-string> <output.csv>"
No, lazy
is lazy and the string concatenation doesn’t happen at the caller site, but can be requested explicitly by the callee.
It gets a bit clearer if we look at the second enforce
as there’s runtime work involved:
format
and all the expensive work of formatting the error message is never done on the default path, but only if an exception actually gets thrown. Ignore the %(%s, %)
formatting string for a bit, it will be explained soon.
Ok, but how does that work?
In short: the compiler does a few smart lowerings for you and creates an anonymous lambda. It’s more complicated in practice, and interested readers can learn more at Walter’s advanced article D’s lazy
.
For now I will use a simple trick to show what’s going on under the hood. The AST explorer at run.dlang.io allows us to peek at the internal representation of a D source file in the compiler after all semantic processing has been done. This means we can see that for the first enforce
the concatenation is even done at compile-time:
As mentioned this is a representation of the internal state of the compiler. Hence, nice aliases like string
which is an alias
for an array of const(char)
elements are resolved and numeric types are serialized with their inferred type. Similarly, \x0a
is the hexadecimal representation of the new line character \n
and delegate const(char)[]() =>
is a lambda function without arguments that returns a string
. Of course, D has a shorthand syntax for lambda functions: () => "hello"
, but the compiler internally expands this syntax sugar.
But there’s more magic here. What’s
__FILE__
and__LINE__
?
Remember that D is a compiled language and accessing the stack isn’t as easy as asking the interpreter nicely. These two default arguments are automatically set by the compiler with the file and line number of the caller. This is important for logging or throwing exceptions like we have done here.
So, an API author can simply say “Hey, I would like to know the line number of my caller.” and doesn’t depend on the user hacking the replacements like its done in C/C++ with preprocessor macros:
In fact, D doesn’t even have a preprocessor.
5) auto
and a statically typed language
Hmm, but what’s
auto
? I thought D has a static type system?
Yes D is statically typed, but the compiler is pretty smart, so we can let him do all the hard work for us. auto
is a filler word for the compiler that means “whatever the type of the assignment, use this as the type of this variable”.
6) What the heck is UFCS?
One of the major features of D is the Unified Function Call Syntax (UFCS). In short, the compiler will look up a function in the current namespace if it’s not found as a member function of a type, but let’s go through this step by step.
I looked at the documentation of
File
and it has a methodbyLine
. So where’s the magic?
Have another look at map
, it’s located in std.algorithm
.
Okay, wait. How does this work?
The compiler internally rewrites the expression File.byLine.map
to the following:
Missing parenthesis are allowed too - after all the compiler knows that the symbol is a function.
Okay, but what’s up with this
!(a => a.splitter(",")))
?
!
is similar to C++/Java’s <>
and allows to instantiate a template. In this case it’s a lambda function of a => a.splitter(",")
. Notice that for splitter
UFCS is used again and your brain might be more used to reading splitter(a, ",")
for now.
7) Ranges
Okay to recap, we have taken the input of a file by line, splitting every line by commas ,
.
Wouldn’t this result in a lot of unnecessary allocation?
The short answer is: D uses “iterators on steroids” which are lazy and work is only done when explicitly requested. Usually range algorithms don’t even require any heap allocation as everything is done on the stack.
For example, in the next line .front
returns the the first line though which countUntil
explicitly iterates:
So lines.front
looks something like:
countUntil
will return the of the first match or -1
otherwise. It’s a bit similar to indexOf
function known from e.g. JavaScript, but it accepts a template. So we could have supplied a custom predicate function:
8) std.format: and compile-time checking of parameters
The next lines are:
I have never seen
writefln("%(%s, %)")
. What happens here?
writefln
is just a handy wrapper around D’s format
function. format
itself provides a lot of options for serialization, but it’s very similar to printf
, although it does provide a few goodies like the special syntax for arrays %(%s, %)
.
This syntax opens an array formatting “scope” by %(
and closes it with %)
. Within this array “scope” the elements should be formatted with %s
(their string
serialization) and use ,
a delimiter between the element.
"%(%s, %)"
will quote the elements by default, which is useful in most cases, but -
can be used to avoid quoting. However, as th Expressive C++17 Coding Challenge has an expected output without quotes, "%-(%s,%)"
is used to avoid quoting and concats the elements without a delimiter. We can use rdmd
to test this:
> head -n1 input.csv | rdmd --loop='writefln("%-(%s|%)", line.splitter(","))'
name|surname|city|country
--loop
is a simple wrapper around foreach (line; stdin.byLine) { … }
and makes it even easier to use D in command-line pipes.
%( … %)
a shorthand syntax that often comes in handy, but if you don’t like it there are many other ways to achieve the same result. For example, joiner
:
Let’s get back to
enforce
. How would such an error message look like?
object.Exception@./main.d(9): Invalid column name. Valid are: "name", "surname", "city", "country"
----------------
??:? pure @safe void std.exception.bailOut!(Exception).bailOut(immutable(char)[], ulong, const(char[])) [0x7a34b57e]
??:? pure @safe bool std.exception.enforce!(Exception, bool).enforce(bool, lazy const(char)[], immutable(char)[], ulong) [0x7a34b4f8]
??:? _Dmain [0x7a34b17f]
Okay, but isn’t
printf
bad and unsafe? I heard that languages like Python are moving away from C-like formatting.
A Python library can only realize that arguments and formatted string don’t fit when it’s called. In D, the compiler knows the types of the arguments and if you pass the format string at compile-time, guess what, the format can be checked compile-time. Try to compile a format string that tries to format strings as numbers:
The compiler will complain:
/dlang/dmd/linux/bin64/../../src/phobos/std/stdio.d(3876): Error: static assert "Incorrect format specifier for range: %d"
onlineapp.d(4): instantiated from here: writefln!("%d", string)
Wow, that’s really cool. How does this work?
D has another unique feature: compile-time function evaluation (CTFE) that allows to execute almost any function at compile-time. All that happens is that writefln
is instantiated at compile-time with the string as template argument and then it calls the same format
function that would normally be called at run-time with the known format string. The coolest part about this is that there’s no special casing in the compiler and everything is just a few lines of library code.
9) Let’s parse the file
Now that we have found the index of the replacement column, have opened the output csv file and have already written the header to it, all that’s left is to go over the input CSV file line by line and replace the specific CSV column with the replacement
:
One of the cool parts of D ranges is that they are so flexible. You want to do everything in a functional way? D has you covered:
There’s another cool thing about D - std.parallelism
. Have you ever been annoyed that a loop takes too long, but didn’t know a quick way to parallelize your code? Again, D has you covered with .parallel
:
No way. I don’t believe this can be so simple.
Just try it yourself.
The Garbage Collector (GC)
On the internet and especially on reddit and HackerNews there’s a huge criticism of D’s decision to do use a GC. Go
, Java
, Ruby
, JavaScript
etc. all use a GC, but I can’t better phrase it than Adam D. Ruppe:
D is a pragmatic language aimed toward writing fast code, fast. Garbage collection has proved to be a smashing success in the industry, providing productivity and memory-safety to programmers of all skill levels. D’s GC implementation follows in the footsteps of industry giants without compromising expert’s ability to tweak even further.
So, ask your question:
Okay, “ability to tweak even further” sounds a bit vague, what does this mean? I can tweak the memory usage?
Well, of course you can do that, but that’s something most languages with a GC allow you to do. D allows you to get the benefit of both worlds, profit from the convenience of the GC and use manual allocation methods for the hot paths in your program. This is great, because you can use the same language for prototyping and shipping your application.
A short and simplified summary of allocation patterns in D:
- RAII is supported (e.g.
File
you saw earlier is reference-counted and automatically deallocates its buffer and close the file once all references are dead) std.typecons
provides a lot of library goodies likeUnique
,Scoped
,RefCounted
for@nogc
allocation- there’s
std.experimental.allocator
for everyone with custom allocation needs malloc
and friends are available in D too (everything from C is) - though if you want to use the C heap allocator I recommend its high-level wrapper
Mike Parker has recently started an extensive GC Series on the DBlog which I recommend to everyone who prefers performance over convenience.
Other goodies
std.csv
Hey, I saw that there’s
std.csv
in D, why didn’t you use it?
Apart from the motivation to be comparable to C++ and Rust which don’t have a built-in CSV library, it felt like cheating:
std.getopt
One of the reasons why this challenge used positional arguments and no flags is that argument parsing is pretty hard in C++. It’s not in D. std.getopt
provides convenience for everything out of the box:
DMD, LDC and GDC
One of the things that newcomers are often getting confused by is that D has three compilers. The short summary is:
- DMD (DigitalMars D compiler) - latest greatest features + fast compilation (= ideal for development)
- LDC (uses the LLVM backend) - battle-tested LLVM backend + sophisticated optimizers + cross-compilation (=ideal for production)
- GDC (uses the GCC backend) - similar points as LDC
Benchmark and performance
Benchmarking a language compiler is a bit tricky as very often you end up benchmarking library functions. In general, D code can be as fast as C++ and often is even faster - after all the LDC and GDC compilers have the same backend as clang++
or g++
with all its optimization logic. If you are interested to see how D programs perform against similar programs written in other languages, checkout Kostya’s benchmarks.
There’s also an excellent performance benchmark from Jon Degenhardt (member of eBay’s data science team) on how eBay’s tsv-utils compare against existing CSV/TSV processing tools written in C, Go, and Rust.
Apart from the typical -O3
and -release
flags, the performance-savvy can use -boundscheck=off
. Additionally LDC also makes it easy to do link-time optimization (LTO) and profile-guided optimization (PGO). According to Jon’s benchmarks LTO brings on average an additional performance gain of 10 %, just by adding the -flto=full
flag. If you want to learn more about this LTO and PGO in D, checkout his superb tutorial or the in-depth technical article about LTO by Johan Engelen (one of the LDC developers).
@safe
Even though D is a system programming language that allows you to mess with pointers, raw memory and even inline assembly, it provides a sane way to deal with the dirty details. D has a @safe
subset of the language in which the compiler will enforce that you don’t do anything stupid thing and shoot yourself in the feet with e.g. accessing undefined memory.
Unittest
One strategic advantage of D is that unit-testing is so easy as it’s built-in in the language and compiler. This is a valid D program:
And with -unittest
the compiler can be instructed to emit unittest block to the object files or binary. Here, rdmd
is again a friendly tool and you can directly go ahead and test your line with you this:
rdmd -main -unittest test.d
No advanced tooling setup required. Of course, this also means that it’s particulary easy to automatically verify all examples that are listed in the documentation, because there part of the testsuite. I even went one step further and made it possible to directly edit and run the examples on dlang.org.
Other cool D features
There are many other cool features that D offers that didn’t make it in this article, but as a teaser for future articles:
- Code generation within the language (cut down your boilerplate)
- Strong and easy Compile-Time introspection (Meta-programming)
alias this
for subtyping-betterC
(using D without a runtime)mixin
for easily generating code- A module system that doesn’t suck
debug
attribute to break out ofpure
code- Built-in documentation
- Contracts and invariants
scope(exit)
andscope(failure)
for structuring creation with its destruction- Native interfacing with C (and most of C++)
with
for loading symbols into the current name
For a full list, see the Overview of D and don’t forget that the full language specification is readable in one evening.
Downsides
Okay, so you say D is so great, but why hasn’t it taken off?
There’s a lot more to a programming language than just the language and compiler. D has to fight with the problems all young languages have to deal with e.g. small ecosystem, few tutorials / sparse documentation and occasional rough edges. Languages like Kotlin, Rust or Go have it a lot easier, because they have a big corporate sponsor which gives these language a big boost.
Without such a boost, it’s a chicken/egg problem: if nobody is learning D, it also means that no one can write tutorials or better documentation. Also many people have learnt a few languages and use them in production. There’s little incentive for them to redesign their entire stack.
However, things improved greatly over the last years and nowadays even companies like Netflix, eBay, or Remedy Games use D. A few examples:
- the fastest parallel file system for High Performance Computing is written in D
- if you drive by train in Europe, chances are good that you were guided by D (Funkwerk - the company that manages the transport passenger information system - develops their software in D)
- if you don’t use an Adblocker, chances are good that algorithms written in D bid in real-time for showing you advertisement (two of the leading companies in digital advertising (Sociomantic and Adroll) use D)
The organizations using D page lists more of these success stories.
Of course, D - like every other language - has its “ugly” parts, but there’s always work in progress to fix these and compared to all other languages I have worked with, the ugly parts are relatively tiny.
Where to go from here?
Okay that sounds great, but how do I install D on my system?
Use the install script:
curl https://dlang.org/install.sh | bash -s
And start hacking!
Acknowledgements
Thanks a lot to Timothee Cour, Juan Miguel Cejuela, Jon Degenhardt, Lio Lunesu, Mike Franklin, Steven Schveighoffer, Simen Kjærås, Walter Bright, Arredondo, Martin Tschierschke, Nicholas Wilson, Arun Chandrasekaran, Per Nordlöw, John Gabriele, jmh530, Dukc, tornchi, and ketmar for their helpful feedback.
A huge thanks also goes to Jonathan Boccara and Bartłomiej Filipek for organizing the Expressive C++17 Coding Challenge and opening the discussion about expressiveness of modern systems programming languages.
Attachements
It’s possible to do three easy tweaks to make I/O faster in D:
- disabling auto-decoding with
byCodeUnit
- non-thread-safe I/O with
lockingTextWriter
- use of
std.mmfile