Hadoop Big Data: Combiner/serialization/sorting in mapreduce

  • Combiner in mapreduce

(1) combiner is a component other than Mapper and Reducer in MR programs
(2) The parent class of combiner components is Reducer
(3) The difference between Combiner and reducer lies in the location of operation:
Combiner runs at every maptask node
Reducer receives the output of all global Mapper s.

  • Serialization in mapreduce

(1) Java serialization is a heavyweight serialization framework (Serializable). When an object is serialized, it will be accompanied by a lot of additional information (various check information, header, inheritance system...). Therefore, it is bulky and not easy to transmit efficiently in the network.
So hadoop has developed its own Writable mechanism, which is streamlined and efficient.
Simple code validates the difference between the two serialization mechanisms:

public class TestSeri {
	public static void main(String[] args) throws Exception {
		//Define two ByteArray Output Stream to receive serialization results from different serialization mechanisms
		ByteArrayOutputStream ba = new ByteArrayOutputStream();
		ByteArrayOutputStream ba2 = new ByteArrayOutputStream();

		//Define two DataOutputStream for jdk standard serialization of common objects
		DataOutputStream dout = new DataOutputStream(ba);
		DataOutputStream dout2 = new DataOutputStream(ba2);
		ObjectOutputStream obout = new ObjectOutputStream(dout2);
		//Define two bean s as serialized source objects
		ItemBeanSer itemBeanSer = new ItemBeanSer(1000L, 89.9f);
		ItemBean itemBean = new ItemBean(1000L, 89.9f);

		//Serialization differences between String and Text types
		Text atext = new Text("a");
		// atext.write(dout);
		itemBean.write(dout);

		byte[] byteArray = ba.toByteArray();

		//Compare serialization results
		System.out.println(byteArray.length);
		for (byte b : byteArray) {

			System.out.print(b);
			System.out.print(":");
		}

		System.out.println("-----------------------");

		String astr = "a";
		// dout2.writeUTF(astr);
		obout.writeObject(itemBeanSer);

		byte[] byteArray2 = ba2.toByteArray();
		System.out.println(byteArray2.length);
		for (byte b : byteArray2) {
			System.out.print(b);
			System.out.print(":");
		}
	}
}
  • Preliminary sorting of mapreduce

MR program will sort the data in the process of processing the data. The order is based on the key of mapper output.

Keywords: Java network Hadoop JDK

Added by northcave on Mon, 30 Sep 2019 23:29:29 +0300