Find in Files with F#

Introduction

There are many find in files tools out there – the one I mainly use is Visual Studio’s. They’re great for finding bits of text using wildcards or even regular expressions (if you can stomach it), but sometimes I find myself wanting to do something a little different than those tools allow. One example is where I want to find log file lines which were written between certain dates or times. Those kind of searches generally require the parsing of part of each line in order to derive some meaning. Generic find in files tools rarely offer this kind of functionality.

As usual, F# interactive is my tool of choice for implementing my ad-hoc requirements. It should be noted at this point that I’m going to do my searches line by line to make things simple. However, once you get to grip with how the code works, modifying it to cater for more specific requirements shouldn’t be a problem.

Core Definitions

We’ll start our F# script by opening the required namespaces:

  1. open System
  2. open System.IO

We’ll now define a type which represents a result. We’ll include a line number, the file the line was found in, and the contents of the line itself. The contents can be any type (in case we want to convert the line to something else). We’ll include a sensible ToString override for easy printing.

  1. type 'a SearchResult(lineNo:int,file:string,content:'a) =
  2.     member this.LineNo = lineNo
  3.     member this.File = file
  4.     member this.Content = content
  5.     override this.ToString() =
  6.         sprintf "line %d in %s – %O" this.LineNo this.File this.Content

Here is perhaps one of the most re-used functions I’ve ever written. It turns a file name into a lazily evaluated sequence of lines. This is key to allowing low memory consumption when searching large files, assuming that the lines aren’t millions of characters long.

  1. let private read (file:string) =
  2.     seq { use reader = new StreamReader(file)
  3.           while not reader.EndOfStream do yield reader.ReadLine() }

Next up are the functions which will do our searching for us. search_file searches a single file using two functions and returns a sequence of SearchResults.

  1. let search_file parse check file =
  2.     read file
  3.     |> Seq.mapi (fun i l -> i + 1, parse l)
  4.     |> Seq.filter check
  5.     |> Seq.map (fun (i, l) -> SearchResult(i,  file, l))

One of the functions is named parse and can be used to convert a line to an instance of another type. It has the following type:

  1. string -> 'a

For simple string searches we can simply pass the built-in id function in. If we want to do something more complicated, like grab the line and a date parsed from it, we can always supply a function which returns some other type.

The other function parameter is named check. check has the following type, with the int representing a line number and the generic type ‘a representing the parsed contents of the line. It returns a bool representing whether or not the line should be included in the results or not.

  1. (int * 'a) -> bool

The next function – named ‘search’ – simply uses search_file to search a collection of files. Each file is mapped to a collection of search results. The resulting sequence of sequences of results is then transformed into a simple one dimensional sequence of results via Seq.concat. The combination of Seq.map and Seq.concat performs a similar function to Linq’s SelectMany extension method.

  1. let search parse check =
  2.     Seq.map (fun file -> search_file parse check file)
  3.     >> Seq.concat

Calling search by passing in a parse and a check function with a sequence of files is now enough to perform our search. However, I’d like a default way of iterating through the sequence printing each result, ending in a print of the number of results. This kind of functionality can be catered for using Seq.fold, which visits each item in a sequence whilst passing an accumulator argument through the whole lot. We’ll use the accumulator to count our results, and print each result during the fold using the SearchResult type’s ToString override.

  1. let print_results<'a> : 'a SearchResult seq -> unit =
  2.     Seq.fold (fun c r -> printfn "%O" r; c + 1) 0
  3.     >> printfn "%d results"

Note that we need a type annotation for the function here to make it generic. By removing the type annotation, the function’s type will be determined by how it is first used, which isn’t what we want.

A Simple Search

Usage is as follows (with some sample results). This example searches all .cs files under D:\_codeplex for lines which look like they might be test methods, and prints the results to the console.

  1. Directory.GetFiles(@"c:\_src\", "*.cs", SearchOption.AllDirectories)
  2. |> search id (fun (i, l) -> l.Contains("public void Test"))
  3. |> print_results
  1. line 16 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\AssertsToStringConverterTests.cs –         public void TestConvert(Object value, Type targetType, Object parameter, String cultureString, Object expectedValue)
  2. line 25 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\AssertsToStringConverterTests.cs –         public void TestConvertBack()
  3. line 19 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\NameToStringConverterTests.cs –         public void TestConvert(Object value, Type targetType, Object parameter, String cultureString, Object expectedValue)
  4. line 28 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\NameToStringConverterTests.cs –         public void TestConvertBack()
  5. line 18 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\StatusToBrushConverterTests.cs –         public void TestConvert(Status status)
  6. line 28 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\StatusToBrushConverterTests.cs –         public void TestConvertBack()
  7. line 17 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\StatusToStringConverterTests.cs –         public void TestConvert(Status status)
  8. line 27 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\StatusToStringConverterTests.cs –         public void TestConvertBack()
  9. line 17 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\SuiteTypeToIsExpandedConverterTests.cs –         public void TestConvert(SuiteType suiteType)
  10. line 27 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\SuiteTypeToIsExpandedConverterTests.cs –         public void TestConvertBack()
  11. line 17 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\SuiteTypeToStringConverterTests.cs –         public void TestConvert(SuiteType suiteType)
  12. line 27 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\SuiteTypeToStringConverterTests.cs –         public void TestConvertBack()
  13. line 16 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\TimeToStringConverterTests.cs –         public void TestConvert(Object value, Type targetType, Object parameter, String cultureString, Object expectedValue)
  14. line 25 in c:\_src\nunitresultsexplorer\NUnitResultsExplorerTests\Unit\TimeToStringConverterTests.cs –         public void TestConvertBack()
  15. line 23 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\ResultsFileTests.cs –         public void TestSample1(String resourceFile, Status status, Int32 testCaseCount, IList<ITestResult> expectedResults)
  16. line 24 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestCaseTests.cs –         public void TestStatus(Status expectedStatus)
  17. line 39 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestCaseTests.cs –         public void TestName(String name)
  18. line 54 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestCaseTests.cs –         public void TestNamespace(String nspace)
  19. line 66 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestCaseTests.cs –         public void TestTime(Nullable<TimeSpan> time)
  20. line 96 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestCaseTests.cs –         public void TestAsserts(Nullable<Int32> asserts)
  21. line 173 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestCaseTests.cs –         public void TestCount()
  22. line 24 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestSuiteTests.cs –         public void TestConstructor(String name, String nspace, Status status, SuiteType suiteType, Nullable<TimeSpan> time, Nullable<Int32> asserts, IList<ITestResult> innerCases)
  23. line 72 in c:\_src\nunitresultsexplorer\NUnitResultsTests\Unit\TestSuiteTests.cs –         public void TestCount(String resourceFile, Int32 testCaseCount, Status status, IList<ITestResult> results)
  24. 23 results

A Log File Search

Fair enough, but that can be achieved easily using the find in files tool in Visual Studio right? Lets try the log file search I mentioned in my opening paragraph. I’ve generated some sample log files, where the first 21 characters of each line is parsable by DateTime’s static Parse method.

To make the code for this search a little easier to understand, I’ll define a type for our log lines. We could use a tuple of type DateTime*string, but that could end up being pretty difficult to read.

We’ll include a static Parse method on that type which is able to convert each line to a LogLine:

  1. type LogLine (d:DateTime, l:string) =
  2.     member this.Date = d
  3.     member this.Line = l
  4.     override this.ToString() = this.Line
  5.     static member Parse(line:string) =
  6.         LogLine(line.Remove(21) |> DateTime.Parse, line)

Next, it’s a simple case of composing our functions together. We’ll search for all lines which were logged on the 13th of September 2010 between 00:00 and 00:15 as an example.

  1. Directory.GetFiles(@"c:\logs\", "*.log", SearchOption.AllDirectories)
  2. |> search LogLine.Parse (fun (i, line) -> line.Date >= DateTime(2010, 9, 13) &&
  3.                                           line.Date < DateTime(2010, 9, 13, 0, 15, 0))
  4. |> print_results
  1. line 4017 in c:\logs\log00.log – 2010-09-13 00:02:36.284 Something got cut off bef
  2. line 4018 in c:\logs\log00.log – 2010-09-13 00:06:55.522 Something unexpected happened
  3. line 3927 in c:\logs\log01.log – 2010-09-13 00:02:52.339 Something timed out
  4. line 3928 in c:\logs\log01.log – 2010-09-13 00:07:27.036 Something got cut off bef
  5. line 3929 in c:\logs\log01.log – 2010-09-13 00:14:53.763 Something unexpected happened
  6. line 4105 in c:\logs\log02.log – 2010-09-13 00:00:18.957 Started something
  7. line 4106 in c:\logs\log02.log – 2010-09-13 00:04:20.651 Started something
  8. line 4107 in c:\logs\log02.log – 2010-09-13 00:09:02.083 Something unexpected happened
  9. line 4108 in c:\logs\log02.log – 2010-09-13 00:09:44.824 Something unexpected happened
  10. line 4109 in c:\logs\log02.log – 2010-09-13 00:10:26.029 Something finished
  11. line 4110 in c:\logs\log02.log – 2010-09-13 00:11:38.986 Something unexpected happened
  12. line 4111 in c:\logs\log02.log – 2010-09-13 00:11:47.936 Something timed out
  13. line 3969 in c:\logs\log03.log – 2010-09-13 00:01:53.655 Something finished
  14. line 3970 in c:\logs\log03.log – 2010-09-13 00:08:58.613 Something timed out
  15. line 3971 in c:\logs\log03.log – 2010-09-13 00:13:17.185 Something finished
  16. line 3972 in c:\logs\log03.log – 2010-09-13 00:13:31.977 Something finished
  17. line 3973 in c:\logs\log03.log – 2010-09-13 00:13:54.558 Started something
  18. line 4013 in c:\logs\log04.log – 2010-09-13 00:02:34.291 Started something
  19. line 4014 in c:\logs\log04.log – 2010-09-13 00:03:06.385 Something unexpected happened
  20. line 4015 in c:\logs\log04.log – 2010-09-13 00:05:30.030 Something timed out
  21. line 4016 in c:\logs\log04.log – 2010-09-13 00:07:18.479 Something unexpected happened
  22. line 4017 in c:\logs\log04.log – 2010-09-13 00:13:37.039 Something finished
  23. line 4018 in c:\logs\log04.log – 2010-09-13 00:13:45.505 Something timed out
  24. 23 results

Cool. We can now search our textual log files based on dates!

Conclusion

I’ve found the functions included in this article to be really useful when performing complex searches involving simple log or csv files. However, line by line is a limitation which I have had to work around once or twice. Log files such as the one in my example may very well contain stack traces containing new line characters where we’d ideally want to treat the entire trace as a single line.

The main principle to focus on when adapting this code for such searches is to translate a file into a sequence of searchable objects. Line by line is probably the simplest way to turn a file into a sequence of objects, and it’s easy to make such searches generic. With more complex searches, it’s a little more difficult to attain that versatility without making the code diffcult to understand. Maybe I’ll try to deal with that in a future post.

Anyway, for now, the entire line by line search script including sample usage is pasted below.

  1. open System
  2. open System.IO
  3.  
  4. type 'a SearchResult(lineNo:int,file:string,content:'a) =
  5.     member this.LineNo = lineNo
  6.     member this.File = file
  7.     member this.Content = content
  8.     override this.ToString() =
  9.         sprintf "line %d in %s – %O" this.LineNo this.File this.Content
  10.  
  11. let read (file:string) =
  12.     seq { use reader = new StreamReader(file)
  13.           while not reader.EndOfStream do yield reader.ReadLine() }
  14.  
  15. let search_file parse check file =
  16.     read file
  17.     |> Seq.mapi (fun i l -> i + 1, parse l)
  18.     |> Seq.filter check
  19.     |> Seq.map (fun (i, l) -> SearchResult(i,  file, l))
  20.  
  21. let search parse check =
  22.     Seq.map (fun file -> search_file parse check file)
  23.     >> Seq.concat
  24.  
  25. let print_results<'a> : 'a SearchResult seq -> unit =    
  26.     Seq.fold (fun c r -> printfn "%O" r; c + 1) 0
  27.     >> printfn "%d results"
  28.  
  29. Directory.GetFiles(@"c:\_src\", "*.cs", SearchOption.AllDirectories)
  30. |> search id (fun (i, l) -> l.Contains("public void Test"))
  31. |> print_results

If you want to try out the log file search, you can generate some logs using the code below:

  1. open System
  2. open System.IO
  3.  
  4. let r = Random()
  5.  
  6. let logDescs =
  7.     [| "Started something"
  8.        "Something timed out"
  9.        "Something finished"
  10.        "Something unexpected happened"
  11.        "Something got cut off bef" |]
  12.  
  13. let getRandomLogLine i =
  14.     sprintf "2010-09-%02i %02i:%02i:%02i.%03i %s"
  15.             (r.Next(1, 31))
  16.             (r.Next(0, 24))
  17.             (r.Next(0, 60))
  18.             (r.Next(0, 60))
  19.             (r.Next(0, 1000))
  20.             (logDescs.[r.Next(0, logDescs.Length)])
  21.  
  22. let writeLogStuff (file:string) =
  23.     use w = new StreamWriter(file)
  24.     Seq.init 10000 (fun i -> getRandomLogLine i)
  25.     |> Seq.sort
  26.     |> Seq.iter (fun l -> w.WriteLine(l))
  27.    
  28. Seq.init 5 (sprintf @"c:\logs\log%02i.log")
  29. |> Seq.iter writeLogStuff
Share and Enjoy:
  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Yahoo! Buzz
  • Twitter
  • Google Bookmarks
  • email
  • LinkedIn
  • Technorati

Leave a Reply

Your email address will not be published. Required fields are marked *