Map – Reduce Data Processing with Hadoop and Spark.
- General
Map – Reduce Data Processing with Hadoop and Spark.
What is MapReduce?
MapReduce is a programming framework that allows us to perform distributed and parallel processing on large data sets in a distributed environment.
- MapReduce consists of two distinct tasks — Map and Reduce.
- As the name MapReduce suggests, reducer phase takes place after the mapper phase has been completed.
- So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs.
- The output of a Mapper or map job (key-value pairs) is input to the Reducer.
- The reducer receives the key-value pair from multiple map jobs.
- Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output.
A Word Count Example of MapReduce
- We have generated a text file(demo.txt) with Script which has some random String Data.
- Now we start our Hadoop in master Node by Command in sbin directory of hadoop.
start-all.sh
- Now we Put our file on Hdfs System by command
hadoop fs -put '/home/auriga/demo.txt' /auriga/inputdir/
- Now to Run a Hadoop Program we need a Jar File to execute So we create A jar file which contains our java Code/Program of Word Count .
- Open Eclipse -> Make java Project ->Add External Jar Files (Go to add external jar then go to hadoop directory ->share folder ->hadoop ->and hdfs folder jars and common folder jars).
- Then Write Code for Reducer and Mapper Class
package hadoop_map_reduce; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class MapClass extends Mapper<Object, Text, Text, IntWritable> { private static final IntWritable ONE = new IntWritable(1); private Text word = new Text(); protected void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer tokenizer = new StringTokenizer(value.toString()); while (tokenizer.hasMoreTokens()) { String token = tokenizer.nextToken(); word.set(token); context.write(word, ONE); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable count = new IntWritable(); @Override protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); } count.set(sum); context.write(key, count); } } public static void main (String[] arg) throws Exception { Configuration conf=new Configuration(); Job job = Job.getInstance(conf,"word count"); job.setJarByClass(WordCount.class); job.setJobName("wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(MapClass.class); job.setReducerClass(Reduce.class); FileInputFormat.addInputPath(job, new Path(arg[0])); FileOutputFormat.setOutputPath(job, new Path(arg[1])); boolean success = job.waitForCompletion(true); System.exit(job.waitForCompletion(true)?0:1); }
- Now Compile it & then Check if Any Error is coming or not IF not Then go to Export as a Jar File or make a jar file.
- Now open Terminal and to Perform map Reduce Write Code .
hadoop jar '/home/auriga/hadoopWordCount.jar' /auriga/inputdir/demo.txt /auriga/inputdir/outputdir14/
- Here “haddop jar hdfs_input_file_path hdfs_output_path”. Hdfs output path is where you want to save your result .
we can see the result in Hadoop Directory .
Now Lets Perform With same Word Count With Spark.
- Install spark and open terminal and Write “spark-shell”.
- Now write The Command on Terminal .
val text=sc.textFile("hdfs://192.168.43.155:9000/auriga/inputdir/demo.txt");
- we are making a variable text and taking text file from Hdfs system.
val count=text.flatMap(line=>line.split(" "));
- Now we Split the text file based on space .
val map=count.map(word=>(word,1));
- we perform map operation.
val reduce=map.reduceByKey(_+_);
- Performing Reduce Operation
reduce.collect
we all know that spark has has lazy Operation when we perform Transformations on RDD it performed as lazy . And when we perform Actions then it Actually Perform .
Observation
Spark is Way faster than Hadoop . Hadoop takes some Time to Process Map-Reduce function but on the Other hand Spark is Way Faster in Seconds it gives Result.
Conclusion
- If We conclude we understand that spark faster than hadoop but question is Why? Here is Answer
- Spark is faster because it uses random access memory (RAM) instead of reading and writing intermediate data to disks. Hadoop stores data on multiple sources and processes it in batches via MapReduce .
For more Understanding in Deep you can visit these links
Related content
Toll mangement and command centre with TMCC
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
A Smarter Health Safety Solution
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
Building fastest loan portal in India
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
Toll mangement and command centre with TMCC
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results...
Toll mangement and command centre with TMCC
We’re passionately committed to helping our clients and their customers thrive, working side by side to drive customer value and results..
Entreprise IT Transformation and Automation
We understand user and market, create product strategy and design experience for customers and employees to make breakthrough digital products and services