I think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM
Answer from BegemoT on Stack OverflowI think, your way is rather reasonable.
I can imagine different strategies -- for example, you can sort both files before compare (where is efficient implementation of filesort, and unix sort utility can sort several Gbs files in minutes), and, while sorted, you can compare files sequentally, reading line by line.
But this is rather complex way to go -- you need to run external program (sort), or write comparable efficient implementation of filesort in java by yourself -- which is by itself not an easy task. So, for the sake of simplicity, I think you way of chunked read is very promising;
As for how to find reasonable block -- first of all, it may not be correct what "the more -- the better" -- I think, time of all work will grow asymptotically, to some constant line. So, may be you'll be close to that line faster then you think -- you need benchmark for this.
Next -- you may read lines to buffer like this:
final List<String> lines = new ArrayList<>();
try{
final List<String> block = new ArrayList<>(BLOCK_SIZE);
for(int i=0;i<BLOCK_SIZE;i++){
final String line = ...;//read line from file
block.add(line);
}
lines.addAll(block);
}catch(OutOfMemory ooe){
//break
}
So you read as many lines, as you can -- leaving last BLOCK_SIZE of free memory. BLOCK_SIZE should be big enouth to the rest of you program to run without OOM
In an ideal world, you would be able to read in every line of file_2 into memory (probably using a fast lookup object like a HashSet, depending on your needs), then read in each line from file_1 one at a time and compare it to your data structure holding the lines from file_2.
As you have said you run out of memory however, I think a divide-and-conquer type strategy would be best. You could use the same method as I mentioned above, but read in a half (or a third, a quarter... depending on how much memory you can use) of the lines from file_2 and store them, then compare all of the lines in file_1. Then read in the next half/third/quarter/whatever into memory (replacing the old lines) and go through file_1 again. It means you have to go through file_1 more, but you have to work with your memory constraints.
EDIT: In response to the added detail in your question, I would change my answer in part. Instead of reading in all of file_2 (or in chunks) and reading in file_1 a line at a time, reverse that, as file_1 holds the data to check against.
Also, with regards searching the matching lines. I think the best way would be to do some processing on file_1. Create a HashMap<List<Range>> that maps a String ("mat1" - "mat50") to a list of Ranges (just a wrapper for a startOfRange int and an endOfRange int) and populate it with the data from file_1. Then write a function like (ignoring error checking)
boolean isInRange(String material, int value)
{
List<Range> ranges = hashMapName.get(material);
for (Range range : ranges)
{
if (value >= range.getStart() && value <= range.getEnd())
{
return true;
}
}
return false;
}
and call it for each (parsed) line of file_2.
Videos
Exactly what FileUtils.contentEquals method of Apache commons IO does and api is here.
Try something like:
File file1 = new File("file1.txt");
File file2 = new File("file2.txt");
boolean isTwoEqual = FileUtils.contentEquals(file1, file2);
It does the following checks before actually doing the comparison:
- existence of both the files
- Both file's that are passed are to be of file type and not directory.
- length in bytes should be the same.
- Both are different files and not one and the same.
- Then compare the contents.
If you don't want to use any external libraries, then simply read the files into byte arrays and compare them (won't work pre Java-7):
byte[] f1 = Files.readAllBytes(file1);
byte[] f2 = Files.readAllBytes(file2);
by using Arrays.equals.
If the files are large, then instead of reading the entire files into arrays, you should use BufferedInputStream and read the files chunk-by-chunk as explained here.
You need both files sorted by your search keys (recordIdx and topicIdx), so you can do kind of a merge operation like this
open file 1
open file 2
read lineA from file1
read lineB from file2
while (there is lineA and lineB)
if (key lineB < key lineA)
read lineB from file 2
continue loop
if (key lineB > key lineA)
read lineA from file 1
continue
// at this point, you have lineA and lineB with matching keys
process your data
read lineB from file 2
Note that you'll only ever have two records in memory.
If you really need this in Java, why not use java-diff-utils ? It implements a well known diff algorithm.
The below code will serve your purpose irrespective of the content of the file.
import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
public class Test {
public Test(){
System.out.println("Test.Test()");
}
public static void main(String[] args) throws Exception {
BufferedReader br1 = null;
BufferedReader br2 = null;
String sCurrentLine;
List<String> list1 = new ArrayList<String>();
List<String> list2 = new ArrayList<String>();
br1 = new BufferedReader(new FileReader("test.txt"));
br2 = new BufferedReader(new FileReader("test2.txt"));
while ((sCurrentLine = br1.readLine()) != null) {
list1.add(sCurrentLine);
}
while ((sCurrentLine = br2.readLine()) != null) {
list2.add(sCurrentLine);
}
List<String> tmpList = new ArrayList<String>(list1);
tmpList.removeAll(list2);
System.out.println("content from test.txt which is not there in test2.txt");
for(int i=0;i<tmpList.size();i++){
System.out.println(tmpList.get(i)); //content from test.txt which is not there in test2.txt
}
System.out.println("content from test2.txt which is not there in test.txt");
tmpList = list2;
tmpList.removeAll(list1);
for(int i=0;i<tmpList.size();i++){
System.out.println(tmpList.get(i)); //content from test2.txt which is not there in test.txt
}
}
}
The memory will be a problem as you need to load both files into the program.
I am using HashSet to ignore duplicates.Try this:
import java.io.BufferedReader;
import java.io.FileReader;
import java.util.HashSet;
public class FileReader1 {
public static void main(String args[]) {
String filename = "abc.txt";
String filename2 = "xyz.txt";
HashSet <String> al = new HashSet<String>();
HashSet <String> al1 = new HashSet<String>();
HashSet <String> diff1 = new HashSet<String>();
HashSet <String> diff2 = new HashSet<String>();
String str = null;
String str2 = null;
try {
BufferedReader in = new BufferedReader(new FileReader(filename));
while ((str = in.readLine()) != null) {
al.add(str);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
try {
BufferedReader in = new BufferedReader(new FileReader(filename2));
while ((str2 = in.readLine()) != null) {
al1.add(str2);
}
in.close();
} catch (Exception e) {
e.printStackTrace();
}
for (String str3 : al) {
if (!al1.contains(str3)) {
diff1.add(str3);
}
}
for (String str5 : al1) {
if (!al.contains(str5)) {
diff2.add(str5);
}
}
for (String str4 : diff1) {
System.out.println("Removed Path: "+str4);
}
for (String str4 : diff2) {
System.out.println("Added Path: "+str4);
}
}
}
Output:
Removed Path: E:\Users\Documents\hello\b.properties
Added Path: E:\Users\Documents\hello\h.properties
Added Path: E:\Users\Documents\hello\g.properties
HashMap solution
I thought about it and the HashMap solution is instant. I went ahead and coded up an example of it here.
It runs in 0ms while the arrayLists ran in 16ms for the same dataset
public static void main(String[] args) throws Exception {
BufferedReader br1 = null;
BufferedReader br2 = null;
BufferedWriter bw3 = null;
String sCurrentLine;
int linelength;
HashMap<String, Integer> expectedrecords = new HashMap<String, Integer>();
HashMap<String, Integer> actualrecords = new HashMap<String, Integer>();
br1 = new BufferedReader(new FileReader("expected.txt"));
br2 = new BufferedReader(new FileReader("actual.txt"));
while ((sCurrentLine = br1.readLine()) != null) {
if (expectedrecords.containsKey(sCurrentLine)) {
expectedrecords.put(sCurrentLine, expectedrecords.get(sCurrentLine) + 1);
} else {
expectedrecords.put(sCurrentLine, 1);
}
}
while ((sCurrentLine = br2.readLine()) != null) {
if (expectedrecords.containsKey(sCurrentLine)) {
int expectedCount = expectedrecords.get(sCurrentLine) - 1;
if (expectedCount == 0) {
expectedrecords.remove(sCurrentLine);
} else {
expectedrecords.put(sCurrentLine, expectedCount);
}
} else {
if (actualrecords.containsKey(sCurrentLine)) {
actualrecords.put(sCurrentLine, actualrecords.get(sCurrentLine) + 1);
} else {
actualrecords.put(sCurrentLine, 1);
}
}
}
// expected is left with all records not present in actual
// actual is left with all records not present in expected
bw3 = new BufferedWriter(new FileWriter(new File("c.txt")));
bw3.write("Records which are not present in actual\n");
for (String key : expectedrecords.keySet()) {
for (int i = 0; i < expectedrecords.get(key); i++) {
bw3.write(key);
bw3.newLine();
}
}
bw3.write("Records which are in actual but not present in expected\n");
for (String key : actualrecords.keySet()) {
for (int i = 0; i < actualrecords.get(key); i++) {
bw3.write(key);
bw3.newLine();
}
}
bw3.flush();
bw3.close();
}
ex:
expected.txt
one
two
four
five
seven
eight
actual.txt
one
two
three
five
six
c.txt
Records which are not present in actual
four
seven
eight
Records which are in actual but not present in expected
three
six
ex 2:
expected.txt
one
two
four
five
seven
eight
duplicate
duplicate
duplicate
actual.txt
one
duplicate
two
three
five
six
c.txt
Records which are not present in actual
four
seven
eight
duplicate
duplicate
Records which are in actual but not present in expected
three
six
In Java 8 you can use Collection.removeIf(Predicate<T>)
list1.removeIf(line -> list2.contains(line));
list2.removeIf(line -> list1.contains(line));
list1 will then contain everything that is NOT in list2 and list2 will contain everything, that is NOT in list1.
You would need to read only from file which have smallest line(from compareTo perspective). In case both are the same , you read a line from both files, in case one bigger than other, you read only from the file with smaller compareTo. In case you don't read from same files twice in a row it mean you have a difference. All lines between switching reading are different( Switch from reading only from file 1 to file 2 or both or switching from reading only file 2 to file1 or both).
A sample to be more clear. Case you switch from file1 reading to file2:
if(line1.compareTo(line2)>0){
if(lastRead==1) {
System.out.println(previousLines+ " found in "+path1 +" but not in "+ path2);
previousLines.clear();
}
previousLines.add(line2);
line2=in2.readLine();
lastRead = 1;
}
In case line1 is bigger than line2( line1 being current line from file1, line2 current line from file 2), it mean I'll next go to read only from second file. And in case in the past,I've read only from file1(not from both at same time or second one), all lines in previousLines should be listed. In previousLines, I add lines when they are different. lastRead keep track of the last file I read from(0 - both at same time, 1 - only first, 2-only second).
Late edit: All method body, but as I mentioned in the comment,it didn't check what happen if I finish read from one file before another. As it is now it works fine if you set last line of file the same on both files. You can add further checks for readLine is null for one file or another.
void toTitleCase(Path path1, Path path2) {
try(BufferedReader in1= Files.newBufferedReader(path1);
BufferedReader in2= Files.newBufferedReader(path2)) {
String line1=in1.readLine(),line2=in2.readLine();
int lastRead=0;
List<String> previousLines=new ArrayList<>();
while(line1!=null && line2!=null){
if(line1.compareTo(line2)>0){
if(lastRead==1) {
System.out.println(previousLines+ " found in "+path1 +" but not in "+ path2);
previousLines.clear();
}
previousLines.add(line2);
line2=in2.readLine();
lastRead = 2;
} else if(line1.compareTo(line2)<0){
if(lastRead==2) {
System.out.println(previousLines+ " found in "+path2 +" but not in "+ path1);
previousLines.clear();
}
previousLines.add(line1);
line1=in1.readLine();
lastRead = 1;
} else{
if(lastRead==2) {
System.out.println(previousLines+ " found in "+path2 +" but not in "+ path1);
}
if(lastRead==1) {
System.out.println(previousLines+ " found in "+path1 +" but not in "+ path2);
}
previousLines.clear();
line1=in1.readLine();
line2=in2.readLine();
lastRead=0;
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
I thought this might be an interesting problem, so I put something together to illustrate how a difference application might work.
I had a file of words for a different application. So, I grabbed the first 100 words and reduced the size of each down to something I could test with easily.
Word List 1
aback
abandon
abandoned
abashed
abatement
abbey
abbot
abbreviate
abdomen
abducted
aberrant
aberration
abetted
abeyance
Word List 2
aardvark
aback
abacus
abandon
abatement
abbey
abbot
abbreviate
abdicate
abdomen
aberrant
aberration
My example application produces two different outputs. Here's the first output from my test run, the full difference output.
Differences between /word1.txt and /word2.txt
-----------------------------------------------------
------ Inserted ----- | aardvark
aback | aback
------ Inserted ----- | abacus
abandon | abandon
abandoned | ------ Deleted ------
abashed | ------ Deleted ------
abatement | abatement
abbey | abbey
abbot | abbot
abbreviate | abbreviate
------ Inserted ----- | abdicate
abdomen | abdomen
abducted | ------ Deleted ------
aberrant | aberrant
aberration | aberration
abetted | ------ Deleted ------
abeyance | ------ Deleted ------
Now, for two really long files, where most of the text will match, this output would be hard to read. So, I also created an abbreviated output.
Differences between /word1.txt and /word2.txt
-----------------------------------------------------
------ Inserted ----- | aardvark
--------------- 1 line is the same --------------
------ Inserted ----- | abacus
--------------- 1 line is the same --------------
abandoned | ------ Deleted ------
abashed | ------ Deleted ------
-------------- 4 lines are the same -------------
------ Inserted ----- | abdicate
--------------- 1 line is the same --------------
abducted | ------ Deleted ------
-------------- 2 lines are the same -------------
abetted | ------ Deleted ------
abeyance | ------ Deleted ------
With these small test files, there's not much difference between the two reports.
With two large text files, the abbreviated report would be a lot easier to read.
Here's the example code.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
public class Difference {
public static void main(String[] args) {
String file1 = "/word1.txt";
String file2 = "/word2.txt";
try {
new Difference().compareFiles(file1, file2);
} catch (IOException e) {
e.printStackTrace();
}
}
private void compareFiles(String file1, String file2)
throws IOException {
int columnWidth = 25;
int pageWidth = columnWidth + columnWidth + 3;
boolean isFullReport = true;
System.out.println(getTitle(file1, file2));
System.out.println(getDashedLine(pageWidth));
System.out.println();
URL url1 = getClass().getResource(file1);
URL url2 = getClass().getResource(file2);
BufferedReader br1 = new BufferedReader(new InputStreamReader(
url1.openStream()));
BufferedReader br2 = new BufferedReader(new InputStreamReader(
url2.openStream()));
int countEqual = 0;
String line1 = br1.readLine();
String line2 = br2.readLine();
while (line1 != null && line2 != null) {
int result = line1.compareTo(line2);
if (result == 0) {
countEqual++;
if (isFullReport) {
System.out.println(getFullEqualsLine(columnWidth,
line1, line2));
}
line1 = br1.readLine();
line2 = br2.readLine();
} else if (result < 0) {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
} else {
printEqualsLine(pageWidth, countEqual, isFullReport);
countEqual = 0;
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
}
printEqualsLine(pageWidth, countEqual, isFullReport);
while (line1 != null) {
System.out.println(getDifferenceLine(columnWidth,
line1, ""));
line1 = br1.readLine();
}
while (line2 != null) {
System.out.println(getDifferenceLine(columnWidth,
"", line2));
line2 = br2.readLine();
}
br1.close();
br2.close();
}
private void printEqualsLine(int pageWidth, int countEqual,
boolean isFullReport) {
if (!isFullReport && countEqual > 0) {
System.out.println(getEqualsLine(countEqual, pageWidth));
}
}
private String getTitle(String file1, String file2) {
return "Differences between " + file1 + " and " + file2;
}
private String getEqualsLine(int count, int length) {
String lines = "lines are";
if (count == 1) {
lines = "line is";
}
String output = " " + count + " " + lines +
" the same ";
return getTextLine(length, output);
}
private String getFullEqualsLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
return String.format(format, line1) + " | " +
String.format(format, line2);
}
private String getDifferenceLine(int columnWidth, String line1,
String line2) {
String format = "%-" + columnWidth + "s";
String deleted = getTextLine(columnWidth, " Deleted ");
String inserted = getTextLine(columnWidth, " Inserted ");
if (line1.isEmpty()) {
return inserted + " | " + String.format(format, line2);
} else {
return String.format(format, line1) + " | " + deleted;
}
}
private String getTextLine(int length, String output) {
int half2 = (length - output.length()) / 2;
int half1 = length - output.length() - half2;
output = getDashedLine(half1) + output;
output += getDashedLine(half2);
return output;
}
private String getDashedLine(int count) {
String output = "";
for (int i = 0; i < count; i++) {
output += "-";
}
return output;
}
}