Map – Reduce Data Processing with Hadoop and Spark.

What is MapReduce?

MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.

  • MapReduce consists of two distinct tasks — Map and Reduce.
  • As the name MapReduce suggests, reducer phase takes place after the mapper phase has been completed.
  • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
  • The output of a Mapper or map job (key-value pairs) is input to the Reducer.
  • The reducer receives the key-value pair from multiple map jobs.
  • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.

A Word Count Example of MapReduce

  • We have generated a text file(demo.txt) with Script which has some random String Data.
  • Now we start our Hadoop in master Node  by Command in sbin directory of hadoop.
  • Now we Put our file on Hdfs System by command

  • Now to Run a Hadoop Program we need a Jar File to execute So we create A jar file which contains our java Code/Program of Word Count .
  • Open Eclipse -> Make java Project ->Add External Jar Files (Go to add external jar then go to hadoop directory ->share folder ->hadoop ->and hdfs folder jars and common folder jars).

 

  • Then Write Code for Reducer and Mapper Class

  • Now Compile it & then Check if Any Error is coming or not IF not Then go to Export as a Jar File or make a jar file.
  • Now open Terminal and to Perform map Reduce Write Code .

  • Here “haddop  jar hdfs_input_file_path   hdfs_output_path”.  Hdfs output path is where you want to save your result .

we can see the result in Hadoop Directory .

Now Lets Perform With same Word Count With Spark.

  • Install spark and open terminal and Write “spark-shell”.
  • Now write The Command on Terminal .
  • we are making a variable text and taking text file from Hdfs system.
  • Now we Split the text file based on space .
  • we perform map operation.
  • Performing Reduce Operation

we all  know that spark has has lazy Operation when we perform Transformations on RDD it performed as lazy . And when we perform Actions then it Actually Perform .

 

Observation

Spark is Way faster than Hadoop . Hadoop takes some Time to Process Map-Reduce function but on the Other hand Spark is Way Faster in Seconds it gives Result.

Conclusion

  • If We conclude we understand that spark faster than hadoop but question is Why? Here is Answer
  • Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce .
For more Understanding in Deep you can visit these links

 

Comments