HiveBrain v1.2.0
Get Started
← Back to all entries
patternjavaMinor

Writing CSV file from huge JSON data

Submitted by: @import:stackexchange-codereview··
0
Viewed 0 times
filecsvwritinghugejsonfromdata

Problem

I am writing a program that reads from DB and outputs to a CSV file. Besides the regular columnar data there are 2 JSON fields data as well. The table layout looks like this (other fields removed for brevity):

+----+--------------+-------------+-----------------------+
| ID | Product_Type | Json_Data | Demographic_Questions |
+----+--------------+-------------+-----------------------+
| 1 | DPI | {some_JSON} | {another_JSON} |
+----+--------------+-------------+-----------------------+
| 2 | Travel | {some_JSON} | {another_JSON} |
+----+--------------+-------------+-----------------------+


Program logic

  • Read data from DB



  • Store columnar data into a map



  • Convert JSON data into CSV format and store into the map



  • Write map into CSV file



The main program

`public static void extractData(String lastRunDateTime, String extractionType) throws Exception
{
List> flatJson = new ArrayList>();
String result = "";
ResultSet rsData = null;

List productType = new ArrayList<>(); // Store Product Type name for SQL & CSV creation

try {
conn = dbUtil.dbConnect();

String sqlQuery = "SELECT DISTINCT Product_Type FROM Mapping WHERE Extraction_Type = '"+ extractionType +"'";
st = conn.prepareStatement(sqlQuery);
rsData = st.executeQuery();
while(rsData.next()) {
productType.add(rsData.getString("Product_Type"));
}

// Currently there are 4 product types in DB
for(int i = 0; i map = new LinkedHashMap();

//LOG.debug("Sys_Policy_No = " + rsData.getString("Sys_Policy_No"));

map.put("ID", rsData.getString("ID"));
map.put("Product_type", rsData.getString("Product_type"));

// Read JSON data and convert to columns
result = rsData.getString("Json_Data");
if(result != null && result.length() != 0) {
addKeys("", new ObjectMapper().

Solution

JSON library

Assuming ObjectMapper is from the Jackson library, I think you should be able to create only one instance of it as it's safe to do so. Pro-tip: on that link, the developer of Jackson also suggests using ObjectReader/ObjectWriter if you are using Jackson 2.x.

try-with-resources

Since Java 7, you can use try-with-resources to safely and efficiently manage I/O resources, such as your JDBC-related resources. More specifically, you can take a look at this WebLogic blog article to better understand how you can use it for the Connection, Statement and ResultSet objects together.

Variable scope

Your flatJson List is declared quite early on, necessitating you to keep clear()-ing it for each iteration. You can instead considering creating a new List each time.

Modeling JSON as a domain object (...?)

This is just a thought, how about modeling the JSON as a domain object, so that you do less of addKeys() yourself, and perhaps just need a nice toCsvMap() implementation on the domain object to get the Map output you need?

Of course, this very much depends on what you mean 'growing by 30 data elements every other months' as... are these elements just part of an array that your JSON library can easily output as a List? Or do you really mean the JSON payload mutates in different ways even between rows, such that there's no one coherent structure to map it as an object?

SQL Server 2016?

If you are using SQL Server 2016, it looks like you may also rely on it to convert your JSON data to rows and columns... again, per disclaimer above, this depends on how its structure changes over time.

Optimizing bottlenecks

Last but not least, have you already tried profiling - regardless of using precise instrumentation frameworks, or just informally with a stopwatch - your application from the time it queries the database to the time the CSV file is generated? Can the database query be further optimized? Is there some inherent network latency somewhere that is making the code appear to work slower? Is writer.writeAsCSV(flatJson , filename) reasonably efficient? See below.

Writing output

Instead of doing a sub-optimal String concatenation using + in each iteration, consider using the newer Files.write(Path, Iterable, CharSet, OpenOption) method. You just need to map each LinkedHashMap element of your List to a String, and the method will iterate through them for you. In addition, it uses the OS-specific line separator, which may be more preferable depending on your use case.

Since you have 1300 columns, performing a Map.get(Object) twice to retrieve the value for each column, per row, is not going to be the fastest way to do so. Just hold on to that thought for a moment...

Think of the children consumers!

Actually, why is there this requirement to write such a 'sparse' CSV file, where there is never a complete row, and instead you are going to have blocks of values, and then blocks of emptiness depending on the product?

I suppose the output will resemble something the following, if we can sort the rows by product type and there are no other overlapping columns other than the ID and product type:

ID,Product_type,dpi_1,dpi_2,dpi_3,travel_1,travel_2,travel_3,other_1,other_2,other_3
1,dpi,a,b,c,,,,,,
2,dpi,d,e,f,,,,,,
3,dpi,g,h,i,,,,,,
4,travel,,,,j,k,l,,,
5,travel,,,,m,n,o,,,
6,travel,,,,p,q,r,,,
7,other,,,,,,,s,t,u
8,other,,,,,,,v,w,x
9,other,,,,,,,y,z,?


Will it not be better off to create one CSV file per product type, so that the consumers of these data can fully process the product-type-specific file they require, instead of having to cherry-pick columns from a 40 MB file, which will likely be slower as well?

Writing output (cont'd)

Resuming from the earlier section, Java 8 has a Map.getOrDefault(Object, V) method that simplifies your approach of calling Map.get(Object) twice:

// String value = map.get(header) == null ? "" : map.get(header).replace(",", "");
String value = map.getOrDefault(header, "").replace(",", "");


The conversion of a List to a List is relatively straightforward when you think of the approach as such:

-
Create a map of the total columns you have, with elements mapping to themselves, and treat this as the zeroth row, i.e. a single-element List>.

-
Create a stream out of the zeroth row and your actual payload (\$1...n\$ rows), so that you can apply the common step of mapping each column header against all the \$n + 1\$ Maps and concatenating them as a String.

Putting it altogether:

```
private static List flattenAll(List> input) {
Set columns = input.stream()
.flatMap(v -> v.keySet().stream())
.collect(Collectors.toCollection(LinkedHashSet::new));
Map header = columns.stream()
.collect(Collectors.toMap(k -> k, v -> v));
return Stream.concat(Stream.of(header), input.stream())

Code Snippets

ID,Product_type,dpi_1,dpi_2,dpi_3,travel_1,travel_2,travel_3,other_1,other_2,other_3
1,dpi,a,b,c,,,,,,
2,dpi,d,e,f,,,,,,
3,dpi,g,h,i,,,,,,
4,travel,,,,j,k,l,,,
5,travel,,,,m,n,o,,,
6,travel,,,,p,q,r,,,
7,other,,,,,,,s,t,u
8,other,,,,,,,v,w,x
9,other,,,,,,,y,z,?
// String value = map.get(header) == null ? "" : map.get(header).replace(",", "");
String value = map.getOrDefault(header, "").replace(",", "");
private static List<String> flattenAll(List<LinkedHashMap<String, String>> input) {
    Set<String> columns = input.stream()
                                .flatMap(v -> v.keySet().stream())
                                .collect(Collectors.toCollection(LinkedHashSet::new));
    Map<String, String> header = columns.stream()
                                        .collect(Collectors.toMap(k -> k, v -> v));
    return Stream.concat(Stream.of(header), input.stream())
                .map(m -> columns.stream()
                                .map(k -> m.getOrDefault(k, "").replace(",", ""))
                                .collect(Collectors.joining(",")))
                .collect(Collectors.toList());
}

Context

StackExchange Code Review Q#149178, answer score: 3

Revisions (0)

No revisions yet.