Tag Archives: csv

CSV Parsers

If you’ve ever built a CSV parser, you understand how tricky it can be.

Of course, it seems simple at first. Just write one line of code like this:

String[] fields = csvLine.split(",");

Then, you realize some fields have optional quotes. So maybe you end up with something like this:

Pattern pattern = Pattern.compile("\\s*(\"[^\"]*\"|[^,]*)\\s*");
Matcher matcher = pattern.matcher(s);
while (matcher.find()) {
    fields.add(matcher.group(1));
}

But wait, some csv files use quotes to escape quotes like
1,3.2,BCD,”qwer 47″” “”dfg”””,1
So you get fancier and write a regex like this:

boolean foundMatch = csvLine.matches(
    "(?x)^         # Start of string\n" +
    "(?:           # Match the following:\n" +
    " (?:          #  Either match\n" +
    "  [^\",\\n]*+ #   0 or more characters except comma, quote or newline\n" +
    " |            #  or\n" +
    "  \"          #   an opening quote\n" +
    "  (?:         #   followed by either\n" +
    "   [^\"]*+    #    0 or more non-quote characters\n" +
    "  |           #   or\n" +
    "   \"\"       #    an escaped quote (\"\")\n" +
    "  )*          #   any number of times\n" +
    "  \"          #   followed by a closing quote\n" +
    " )            #  End of alternation\n" +
    " ,            #  Match a comma (separating the CSV columns)\n" +
    ")*            # Do this zero or more times.\n" +
    "(?:           # Then match\n" +
    " (?:          #  using the same rules as above\n" +
    "  [^\",\\n]*+ #  an unquoted CSV field\n" +
    " |            #  or a quoted CSV field\n" +
    "  \"(?:[^\"]*+|\"\")*\"\n" +
    " )            #  End of alternation\n" +
    ")             # End of non-capturing group\n" +
    "$             # End of string");

(courtesy of stackoverflow)

You think to yourself, you can’t be the only one that has had to deal with this… and you’re right!

There are several packages that parse CSV files.

  • Apache Commons CSV This has the apache brand name and it seems pretty powerful, with the ability to parse multiple csv formats, including MS Excel CSV, MySQL CSV, RFC 4180 formatted CSV, and Tab-Delimited-Files. The first production version (1.0) was just released, but it has been in development sandbox for several years
  • OpenCSV This has been around for awhile and seems quite popular. It’s very simple to use, but there have been complaints that support is lacking and you often get no responses from the authors. After a 3-year hiatus, the team finally released a new version (3.0) possibly in response to the recent official release of Apache Commons CSV
  • JSefa This is an annotation-based CSV parser. Consider this if you have well-structured CSV’s that you’re parsing. You simply annotate a POJO class and it parses and populates the contents of the CSV file into a java object for you. How convenient!

All of the above are licensed under the Apache License V2.0, meaning they’re safe for academic, commercial, or recreational use.

I ultimately went with Apache Commons CSV because I’m a brand name whore and I’ve already accessorized so many of my projects with apache libraries, which hopefully means less clashing of fashion styles and classes.

If you decide to go with the Commons CSV and use maven to build your projects, simply add this dependency to your pom.xml

<dependency>
    <groupId>org.apache.commons</groupId>
    <artifactId>commons-csv</artifactId>
    <version>1.0</version>
</dependency>

Then parsing a CSV is as simple as this:

Reader in = new FileReader("path/to/file.csv");
Iterable<CSVRecord> records = CSVFormat.EXCEL.parse(in);
for (CSVRecord record : records) {
    String lastName = record.get("Last Name");
    String firstName = record.get("First Name");
}
Tagged , , ,