0
This thread on the FreeBSD mailing discusses why GNU grep (that you get on Linux)
is faster than the grep on FreeBSD. I thought I'd write up some notes on this.
I come from the world of "network intrusion detection", where we search network traffic for patterns indicating hacker activity. In many cases, this means solving the same problem of grep with complex regexes, but doing so very fast, at 10gbps on desktop-class hardware (quad-core Core i7). We in the intrusion-detection world have seen every possible variation of the problem. Concepts like "Boyer-Moore" and "Aho-Corasick" may seem new to you, but they are old-hat to us.
Zero-copy
Your first problem is getting the raw data from the filesystem into memory. As the thread suggests, one way of doing this is "memory-mapping" the file. Another option would be "asynchronous I/O". When done right, either solution gets you "zero-copy" performance. On modern Intel CPUs, the disk controller will DMA the block directly into the CPU's L3 cache. Network cards work the same way, which is why getting 10-gbps from the network card is trivial, even on slow desktop systems.
Double-parsing
Your next problem is
stop with the line parsing, idiots. All these
Continue reading