cancel
Showing results for 
Search instead for 
Did you mean: 

Java Mapping eats superscripts!

Former Member
0 Kudos

Hi All,

This is reference to my previous thread. I'm still trying to locate the missing superscripts and closing tags.

[XML not well formed|]

I suspect the tiny Java mapping used is causing problems. Would greatly appreciate the expert comments/ suugestions on the mapping ( I'm not a Java guy )

public class SeparateFile implements StreamTransformation {

String strXML = new String();

AbstractTrace trace;

private Map param = null;

public void setParameter(Map param) {

this.param = param;

}

public void execute(InputStream in, OutputStream out) {

int strBegin, strEnd;

String ns1String = new String();

String ns2String = new String();

String ns3String = new String();

String headerString = new String();

String outString = new String();

String xmldecl = "<?xml version=\"1.0\" encoding=\"utf-8\"?>";

//String StrXML = new String();

String outString1 = " ";

String BillStr[] = null;

String[] BillStrMod = new String[2000];

trace =

(AbstractTrace) param.get(

StreamTransformationConstants.MAPPING_TRACE);

trace.addInfo("Process Started");

String line = new String();

try {

StringBuffer strbuffer = new StringBuffer();

byte[] b = new byte[4096];

for (int n;(n = in.read(b)) != -1;) {

strbuffer.append(new String(b, 0, n));

}

strXML = strbuffer.toString();

} catch (Exception e) {

System.out.println("Exception Occurred");

}

/************ NameSpace and the Root Element is Trimmed here

  • <ns1:MT_BillPrint xmlns:ns1=\"http://londonhydro.com/Matrix/BILL/BillPrint\">

  • and each invoice taken into array

************/

strXML = strXML.substring(114, strXML.length());

BillStr = strXML.split("</invextract>", -1);

BillStr[0] = xmldecl.concat(BillStr[0]);

/************ Append The Array Values with <!-- end of record -->

  • and write it to string buffer.

************/

for (int cnt = 0; cnt < BillStr.length - 1; cnt++) {

BillStrMod[cnt] =

BillStr[cnt].concat("</invextract><!-- end of record -->");

outString = outString.concat(BillStrMod[cnt]);

}

in = new ByteArrayInputStream(outString.getBytes());

try {

out.write(outString.getBytes());

} catch (Exception e) {

System.out.println("Exception in Writing to output Stream");

}

}

}

Thanks in advance!

Anish

Accepted Solutions (1)

Accepted Solutions (1)

stefan_grube
Active Contributor
0 Kudos

byte[] b = new byte[4096];

for (int n;(n = in.read(b)) != -1;) {

strbuffer.append(new String(b, 0, n));

}

strXML = strbuffer.toString();

I think, this could be the issue.

UTF-8 is a variable length codepage, some characters like ³ are represented with two, three or four bytes. The in.read() assigns exactly 4096 bytes (not characters), so the byte representation for ³ might be split.

When it is split, the first byte alone is no UFT-8 character, but the second byte starts a sequence and together with the " cannot interpreted as well, so both character are not visible.

When you have 19969 occurrecies of ³ and 5 of them are wrong, that is exactly what is expected based on probability, as each ³-character has a probability of 1/4096 to be split.

You have to change that piece of code.

Regards

Stefan

Former Member
0 Kudos

Hi Stefan,

Great minds think alike

I had solved this issue yesterday noon after a weeks trial with permutations and combinations.

I solved it by increasing 4096 to 100000 and it works fine. I'm sure changing it to characters would be the best solution.

I'm sure you have spent sometime researching on this and really appreciate the same.

Thanks!

Anish

stefan_grube
Active Contributor
0 Kudos

> I solved it by increasing 4096 to 100000 and it works fine. I'm sure changing it to characters would be the best solution.

I found following code which might help you:

byte[] bbuf = new byte[in.available()];
    in.read(bbuf);

This needs not be part of a loop as the whole payload as copied to the binary array at once.

henrique_pinto
Active Contributor
0 Kudos

The in.available() method may work fine for most cases, but for huge XML files (which seems to be the case, since he's talking about 14 Mega files), the best approach is still a loop-based one, in order to avoid OOM (out of memory) issues.

The point is that the code is transforming the byte streams to strings before concatenating, and only then it concatenates the strings (which may have already been wrongly decoded). Then you may have the related issues.

A better solution would be to concatenate the byte streams into an ByteArrayOutputStream (e.g. using the .write() method) and only in the end transform it to String (e.g. using something like

String str = new String(out.toByteArray(), "UTF-8");

).

Best,

Henrique.

Former Member
0 Kudos

Henrique,

Thanks a lot for your inputs. I was facing some memory issues ( not to the extend of error though) with in.available() approach. But I'm sure I will land there soon as my file can go up to 40 MB.

With the approach you suggested, 14 MB file just zipped through with no missing superscripts.

This is how I did it.

byte[] b = new byte[4096];
ByteArrayOutputStream out = new ByteArrayOutputStream();
for (int n;(n = in.read(b)) != -1;) 
{
out.write(b);
}
StrXML = new String(out.toByteArray(), "UTF-8");

Should I look for some error with this code?

Thanks again!

Anish

henrique_pinto
Active Contributor
0 Kudos

Hi Anish,

you should be ok with that.

Best,

Henrique.

Former Member
0 Kudos

Hi Henrique , Stefan,

Thanks a lot for your support!

Regards!

Anish

henrique_pinto
Active Contributor
0 Kudos

One last comment on the matter:

When you have 19969 occurrecies of ³ and 5 of them are wrong, that is exactly what is expected based on probability, as each ³-character has a probability of 1/4096 to be split.

This is actually quite interesting.

Applying actual probability in a real life scenario.

However, if you think about it, the probability of one given N-bytes word to be split should not be mandatorily 1/4096 but rather (N-1)/4096 (*). The fact that 5/19969 =~ 1/4096, gives us the input that, for the ³ char, N = 2.

Fun.

Best,

Henrique.

(*) To better understand this, consider that the definition of "the word being split" is the same as "having any of the N bytes in the last position of the array, with the exception of the last byte of the word".

Former Member
0 Kudos

Changed the code, in case if someone is referring this:

byte[] b = new byte[4096];
ByteArrayOutputStream out = new ByteArrayOutputStream();
for (int n;(n = in.read(b)) != -1;) 
{
out.write(b, 0, n);
}
StrXML = new String(out.toByteArray(), "UTF-8");

Thanks!

Anish

Answers (0)