Today’s Links

Posted in: Big Data Technologies- Sep 16, 2015 No Comments
  • Apache Hadoop Map Reduce – Advanced Example

    Map Reduce Example

    Latest updated proper work is at http://blog.hampisoftware.com/?p=20

    Solve the same problem using Apache Spark: http://blog.hampisoftware.com/?p=41

    Use ^^^^

    DATED MATERIAL

    Continuing with my experiments, now I tried to attempt the Patent Citation Example mentioned in the book “Hadoop In Action” by Chuck Lam.


    Data Set

    Visit http://nber.org/patents/
    Choose acite75_99.zip  and it yielded cite75_99.txt

    This file lists the patent number that cites another patent.  In this map reduce example, we are going to attempt to find the reverse.

    For a given patent, how many citations exist.

    What do you need?

    Download Hadoop v1.0.3 from http://hadoop.apache.org
    I downloaded hadoop-1.0.3.tar.gz (60MB)


    Map Reduce Program

    Note: I give out a program that was busted.  Look for commentary following this program as to why it is busted etc. Finally, I give out a working program.

    ==================
    package mapreduce;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.util.Iterator;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    public class PatentCitation extends Configured implements Tool{

        public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {
            protected void map(Text key, Text value, Context context)
                    throws IOException, InterruptedException {
                context.write(value, key);
            }
        }
       
        public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{
            protected void reduce(Text key, Iterable<Text> values, Context context)
                    throws IOException, InterruptedException {
                String csv = “”;
                Iterator<Text> iterator = values.iterator();
                while(iterator.hasNext()){
                    if(csv.length() > 0 ) csv += “,”;
                    csv += iterator.next().toString();
                }
                context.write(key, new Text(csv));
            }
        }
       
        private  void deleteFilesInDirectory(File f) throws IOException {
            if (f.isDirectory()) {
                for (File c : f.listFiles())
                    deleteFilesInDirectory(c);
            }
            if (!f.delete())
                throw new FileNotFoundException(“Failed to delete file: ” + f);
        }
       
        @Override
        public int run(String[] args) throws Exception {
            if(args.length == 0)
                throw new IllegalArgumentException(“Please provide input and output paths”);
           
            Path inputPath = new Path(args[0]);
            File outputDir = new File(args[1]);
            deleteFilesInDirectory(outputDir);
            Path outputPath = new Path(args[1]);

            Job job = new Job(getConf(), “Hadoop Patent Citation Example”);
            job.setJarByClass(PatentCitation.class);

            FileInputFormat.setInputPaths(job, inputPath);
            FileOutputFormat.setOutputPath(job, outputPath);

            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
           
            job.setMapperClass(PatentCitationMapper.class);
            job.setReducerClass(PatentCitationReducer.class);
           
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            return job.waitForCompletion(false) ? 0 : -1;
        }

        public static void main(String[] args) throws Exception {
            System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));
        }
    }


    I created a directory called “input” where the cite75_99.txt was placed.  I also created an empty directory called “output”.  These directories form the input and output directories for the M/R program


    Execution

    First Attempt:

    I executed the program as is. It choked because my /tmp directory and root filesystem exhausted the disk space.

    Second Attempt:

    Now I exclusively configure the hadoop tmp directory so I can place the tmp files wherever I want.
    -Dhadoop.tmp.dir=/home/anil/judcon12/tmp

    Ok, now the program did not choke.  It just ran and ran for 2+ hours. I killed it. It seems the data set is too large to get it finished in a short duration.
    The culprit was the reduce phase. It just did not finish.

    Third Attempt:

    Now I tried to configure the reducers to zero so I can view the output of the map phase.

    I tried the property -Dmapred.reduce.tasks=0.  It made no difference.

    I then added the following deprecated usage of the Job class. That worked.

            job.setNumReduceTasks(0);

    Ok, now the program just undertook the map phase.

    =======================
    May 25, 2012 4:40:42 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
    WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
    WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
    INFO: Total input paths to process : 1
    May 25, 2012 4:40:42 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
    WARNING: Snappy native library not loaded
    May 25, 2012 4:40:42 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
    INFO: setsid exited with exit code 0
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000000_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000000_0′ done.
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@527736bd
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000001_0 is allowed to commit now
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000001_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000001_0′ done.
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5749b290
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000002_0 is allowed to commit now
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000002_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000002_0′ done.
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2a8ceeea
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000003_0 is allowed to commit now
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000003_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000003_0′ done.
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@46238a47
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000004_0 is allowed to commit now
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000004_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000004_0′ done.
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@559113f8
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000005_0 is allowed to commit now
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000005_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000005_0′ done.
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76a9b9c
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000006_0 is allowed to commit now
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000006_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000006_0′ done.
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@560c3014
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000007_0 is allowed to commit now
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000007_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000007_0′ done.
    =======================

    So the Map phase is taking between 1-2mins.  It is the reduce phase that does not end for me.  I will evaluate as to why that is the case. :)

    Let us see what the output of the map phase looks like:

    =============================
    anil@sadbhav:~/judcon12/output$ ls
    part-m-00000  part-m-00002  part-m-00004  part-m-00006  _SUCCESS
    part-m-00001  part-m-00003  part-m-00005  part-m-00007
    ==============================

    You can see the end result of the map phase in these files.

    Now onto figuring out the number of reduce phases.

    Let me look at the file size

    ===================
     anil@sadbhav:~/judcon12/input$ wc -l cite75_99.txt
    16522439 cite75_99.txt
    ===================

    OMG!  that is something like 16 million 522 thousand entries.  Too much for a laptop to handle.


    Let us try another experiment.  Let us choose 20000 patent citations from this file and save it to 20000cite.txt

    When I run the program now,  I see that the M/R execution took all of 10 secs to finish.


    Let us view the results of the M/R execution.

    =============================
    anil@sadbhav:~/judcon12/output$ ls
    part-r-00000  _SUCCESS
    ============================

    When I look inside the part-r-00000 ,  I see one long line of csv patent citations.  Yeah, you are right.  My reducer is busted.  It is not working.  I need to fix it…..
    That is next step….  This exercise was a failure. But there was a lesson here.  If you mess up, you will wait. :)




    Ok,  here is the updated map reduce program that works:
    ======================
    package mapreduce;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    public class PatentCitation extends Configured implements Tool{

        public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {
            protected void map(Text key, Text value, Context context)
                    throws IOException, InterruptedException {

                String[] citation = key.toString().split(“,”);
                context.write(new Text(citation[1]), new Text(citation[0]));
            }
        }

        public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{
            protected void reduce(Text key, Iterable<Text> values, Context context)
                    throws IOException, InterruptedException {
                String csv = “”;
                for(Text val:values){
                    if(csv.length() > 0 ) csv += “,”;
                    csv += val.toString();
                }
                context.write(key, new Text(csv));
            }
        }

        private  void deleteFilesInDirectory(File f) throws IOException {
            if (f.isDirectory()) {
                for (File c : f.listFiles())
                    deleteFilesInDirectory(c);
            }
            if (!f.delete())
                throw new FileNotFoundException(“Failed to delete file: ” + f);
        }

        @Override
        public int run(String[] args) throws Exception {
            if(args.length == 0)
                throw new IllegalArgumentException(“Please provide input and output paths”);

            Path inputPath = new Path(args[0]);
            File outputDir = new File(args[1]);
            deleteFilesInDirectory(outputDir);
            Path outputPath = new Path(args[1]);

            Job job = new Job(getConf(), “Hadoop Patent Citation Example”);
            job.setJarByClass(PatentCitation.class);

            FileInputFormat.setInputPaths(job, inputPath);
            FileOutputFormat.setOutputPath(job, outputPath);

            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            job.setMapperClass(PatentCitationMapper.class);
            job.setReducerClass(PatentCitationReducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            //job.setNumReduceTasks(10000);

            return job.waitForCompletion(false) ? 0 : -1;
        }

        public static void main(String[] args) throws Exception {
            System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));
        }
    }
    =======================================



    Running updated program

    ================================
    May 25, 2012 6:14:26 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
    WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    May 25, 2012 6:14:26 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
    WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    May 25, 2012 6:14:26 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
    INFO: Total input paths to process : 1
    May 25, 2012 6:14:26 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
    WARNING: Snappy native library not loaded
    May 25, 2012 6:14:27 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
    INFO: setsid exited with exit code 0
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: io.sort.mb = 100
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: data buffer = 79691776/99614720
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: record buffer = 262144/327680
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
    INFO: Starting flush of map output
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
    INFO: Finished spill 0
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000000_0′ done.
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3a56860b
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
    INFO: Merging 1 sorted segments
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
    INFO: Down to the last merge-pass, with 1 segments left of total size: 359270 bytes
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_r_000000_0′ to /home/anil/judcon12/output
    May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO: reduce > reduce
    May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_r_000000_0′ done.
    =================================


    Let us look at the output:
    ===============
    anil@sadbhav:~/judcon12/output$ ls
    part-r-00000  _SUCCESS
    ================

    If you look inside the part-xxx file, you will the results:

    ======================
     ”CITED” “CITING”
    1000715 3861270
    1001069 3858600
    1001170 3861317
    1001597 3861811
    1004288 3861154
    1006393 3861066
    1006952 3860293
    …….
    1429311 3861187
    1429835 3860154
    1429968 3860060
    1430491 3859976
    1431444 3861601
    1431718 3859022
    1432243 3861774
    1432467 3862478
    1433649 3861223,3861222
    1433923 3861232
    1434088 3862293
    1435134 3861526
    1435144 3858398
    ….
    ======================


    Woot!



    Additional Information


    If you see the following error, it means you have disk space issues.

    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out


    How do I set my own tmp directory for Hadoop?
    -Dhadoop.tmp.dir=
    or
    -Dmapred.local.dir=    (for map/reduce)

    References

    http://blog.hampisoftware.com

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Sep 10, 2015 No Comments
  • Apache Hadoop Map Reduce – Advanced Example
    Latest updated proper work is at http://blog.hampisoftware.com/?p=20

    Use ^^^^

    DATED MATERIAL

    Continuing with my experiments, now I tried to attempt the Patent Citation Example mentioned in the book “Hadoop In Action” by Chuck Lam.


    Data Set

    Visit http://nber.org/patents/
    Choose acite75_99.zip  and it yielded cite75_99.txt

    This file lists the patent number that cites another patent.  In this map reduce example, we are going to attempt to find the reverse.

    For a given patent, how many citations exist.

    What do you need?

    Download Hadoop v1.0.3 from http://hadoop.apache.org
    I downloaded hadoop-1.0.3.tar.gz (60MB)

    Map Reduce Program

    Note: I give out a program that was busted.  Look for commentary following this program as to why it is busted etc. Finally, I give out a working program.

    ==================
    package mapreduce;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.util.Iterator;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    public class PatentCitation extends Configured implements Tool{

        public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {
            protected void map(Text key, Text value, Context context)
                    throws IOException, InterruptedException {
                context.write(value, key);
            }
        }
       
        public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{
            protected void reduce(Text key, Iterable<Text> values, Context context)
                    throws IOException, InterruptedException {
                String csv = “”;
                Iterator<Text> iterator = values.iterator();
                while(iterator.hasNext()){
                    if(csv.length() > 0 ) csv += “,”;
                    csv += iterator.next().toString();
                }
                context.write(key, new Text(csv));
            }
        }
       
        private  void deleteFilesInDirectory(File f) throws IOException {
            if (f.isDirectory()) {
                for (File c : f.listFiles())
                    deleteFilesInDirectory(c);
            }
            if (!f.delete())
                throw new FileNotFoundException(“Failed to delete file: ” + f);
        }
       
        @Override
        public int run(String[] args) throws Exception {
            if(args.length == 0)
                throw new IllegalArgumentException(“Please provide input and output paths”);
           
            Path inputPath = new Path(args[0]);
            File outputDir = new File(args[1]);
            deleteFilesInDirectory(outputDir);
            Path outputPath = new Path(args[1]);

            Job job = new Job(getConf(), “Hadoop Patent Citation Example”);
            job.setJarByClass(PatentCitation.class);

            FileInputFormat.setInputPaths(job, inputPath);
            FileOutputFormat.setOutputPath(job, outputPath);

            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
           
            job.setMapperClass(PatentCitationMapper.class);
            job.setReducerClass(PatentCitationReducer.class);
           
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            return job.waitForCompletion(false) ? 0 : -1;
        }

        public static void main(String[] args) throws Exception {
            System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));
        }
    }

    I created a directory called “input” where the cite75_99.txt was placed.  I also created an empty directory called “output”.  These directories form the input and output directories for the M/R program

    Execution

    First Attempt:

    I executed the program as is. It choked because my /tmp directory and root filesystem exhausted the disk space.

    Second Attempt:

    Now I exclusively configure the hadoop tmp directory so I can place the tmp files wherever I want.
    -Dhadoop.tmp.dir=/home/anil/judcon12/tmp

    Ok, now the program did not choke.  It just ran and ran for 2+ hours. I killed it. It seems the data set is too large to get it finished in a short duration.
    The culprit was the reduce phase. It just did not finish.

    Third Attempt:

    Now I tried to configure the reducers to zero so I can view the output of the map phase.

    I tried the property -Dmapred.reduce.tasks=0.  It made no difference.

    I then added the following deprecated usage of the Job class. That worked.

            job.setNumReduceTasks(0);

    Ok, now the program just undertook the map phase.

    =======================
    May 25, 2012 4:40:42 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
    WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
    WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
    INFO: Total input paths to process : 1
    May 25, 2012 4:40:42 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
    WARNING: Snappy native library not loaded
    May 25, 2012 4:40:42 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
    INFO: setsid exited with exit code 0
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000000_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000000_0′ done.
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@527736bd
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000001_0 is allowed to commit now
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000001_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000001_0′ done.
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5749b290
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000002_0 is allowed to commit now
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000002_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000002_0′ done.
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2a8ceeea
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000003_0 is allowed to commit now
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000003_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000003_0′ done.
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@46238a47
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000004_0 is allowed to commit now
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000004_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000004_0′ done.
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@559113f8
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000005_0 is allowed to commit now
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000005_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000005_0′ done.
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76a9b9c
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000006_0 is allowed to commit now
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000006_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000006_0′ done.
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@560c3014
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000007_0 is allowed to commit now
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000007_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000007_0′ done.
    =======================

    So the Map phase is taking between 1-2mins.  It is the reduce phase that does not end for me.  I will evaluate as to why that is the case. :)

    Let us see what the output of the map phase looks like:

    =============================
    anil@sadbhav:~/judcon12/output$ ls
    part-m-00000  part-m-00002  part-m-00004  part-m-00006  _SUCCESS
    part-m-00001  part-m-00003  part-m-00005  part-m-00007
    ==============================

    You can see the end result of the map phase in these files.

    Now onto figuring out the number of reduce phases.

    Let me look at the file size

    ===================
     anil@sadbhav:~/judcon12/input$ wc -l cite75_99.txt
    16522439 cite75_99.txt
    ===================

    OMG!  that is something like 16 million 522 thousand entries.  Too much for a laptop to handle.

    Let us try another experiment.  Let us choose 20000 patent citations from this file and save it to 20000cite.txt

    When I run the program now,  I see that the M/R execution took all of 10 secs to finish.

    Let us view the results of the M/R execution.

    =============================
    anil@sadbhav:~/judcon12/output$ ls
    part-r-00000  _SUCCESS
    ============================

    When I look inside the part-r-00000 ,  I see one long line of csv patent citations.  Yeah, you are right.  My reducer is busted.  It is not working.  I need to fix it…..
    That is next step….  This exercise was a failure. But there was a lesson here.  If you mess up, you will wait. :)

    Ok,  here is the updated map reduce program that works:
    ======================
    package mapreduce;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    public class PatentCitation extends Configured implements Tool{

        public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {
            protected void map(Text key, Text value, Context context)
                    throws IOException, InterruptedException {

                String[] citation = key.toString().split(“,”);
                context.write(new Text(citation[1]), new Text(citation[0]));
            }
        }

        public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{
            protected void reduce(Text key, Iterable<Text> values, Context context)
                    throws IOException, InterruptedException {
                String csv = “”;
                for(Text val:values){
                    if(csv.length() > 0 ) csv += “,”;
                    csv += val.toString();
                }
                context.write(key, new Text(csv));
            }
        }

        private  void deleteFilesInDirectory(File f) throws IOException {
            if (f.isDirectory()) {
                for (File c : f.listFiles())
                    deleteFilesInDirectory(c);
            }
            if (!f.delete())
                throw new FileNotFoundException(“Failed to delete file: ” + f);
        }

        @Override
        public int run(String[] args) throws Exception {
            if(args.length == 0)
                throw new IllegalArgumentException(“Please provide input and output paths”);

            Path inputPath = new Path(args[0]);
            File outputDir = new File(args[1]);
            deleteFilesInDirectory(outputDir);
            Path outputPath = new Path(args[1]);

            Job job = new Job(getConf(), “Hadoop Patent Citation Example”);
            job.setJarByClass(PatentCitation.class);

            FileInputFormat.setInputPaths(job, inputPath);
            FileOutputFormat.setOutputPath(job, outputPath);

            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            job.setMapperClass(PatentCitationMapper.class);
            job.setReducerClass(PatentCitationReducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            //job.setNumReduceTasks(10000);

            return job.waitForCompletion(false) ? 0 : -1;
        }

        public static void main(String[] args) throws Exception {
            System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));
        }
    }
    =======================================

    Running updated program

    ================================
    May 25, 2012 6:14:26 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
    WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    May 25, 2012 6:14:26 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
    WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    May 25, 2012 6:14:26 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
    INFO: Total input paths to process : 1
    May 25, 2012 6:14:26 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
    WARNING: Snappy native library not loaded
    May 25, 2012 6:14:27 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
    INFO: setsid exited with exit code 0
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: io.sort.mb = 100
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: data buffer = 79691776/99614720
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: record buffer = 262144/327680
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
    INFO: Starting flush of map output
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
    INFO: Finished spill 0
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000000_0′ done.
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3a56860b
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
    INFO: Merging 1 sorted segments
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
    INFO: Down to the last merge-pass, with 1 segments left of total size: 359270 bytes
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_r_000000_0′ to /home/anil/judcon12/output
    May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO: reduce > reduce
    May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_r_000000_0′ done.
    =================================

    Let us look at the output:
    ===============
    anil@sadbhav:~/judcon12/output$ ls
    part-r-00000  _SUCCESS
    ================

    If you look inside the part-xxx file, you will the results:

    ======================
     ”CITED” “CITING”
    1000715 3861270
    1001069 3858600
    1001170 3861317
    1001597 3861811
    1004288 3861154
    1006393 3861066
    1006952 3860293
    …….
    1429311 3861187
    1429835 3860154
    1429968 3860060
    1430491 3859976
    1431444 3861601
    1431718 3859022
    1432243 3861774
    1432467 3862478
    1433649 3861223,3861222
    1433923 3861232
    1434088 3862293
    1435134 3861526
    1435144 3858398
    ….
    ======================

    Woot!

    Additional Information

    If you see the following error, it means you have disk space issues.

    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out

    How do I set my own tmp directory for Hadoop?
    -Dhadoop.tmp.dir=
    or
    -Dmapred.local.dir=    (for map/reduce)

    References

    http://blog.hampisoftware.com

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Sep 09, 2015 No Comments
  • Apache Hadoop Map Reduce – Advanced Example
    Continuing with my experiments, now I tried to attempt the Patent Citation Example mentioned in the book “Hadoop In Action” by Chuck Lam.


    Data Set

    Visit http://nber.org/patents/
    Choose acite75_99.zip  and it yielded cite75_99.txt

    This file lists the patent number that cites another patent.  In this map reduce example, we are going to attempt to find the reverse.

    For a given patent, how many citations exist.

    What do you need?

    Download Hadoop v1.0.3 from http://hadoop.apache.org
    I downloaded hadoop-1.0.3.tar.gz (60MB)

    Map Reduce Program

    Note: I give out a program that was busted.  Look for commentary following this program as to why it is busted etc. Finally, I give out a working program.

    ==================
    package mapreduce;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;
    import java.util.Iterator;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    public class PatentCitation extends Configured implements Tool{

        public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {
            protected void map(Text key, Text value, Context context)
                    throws IOException, InterruptedException {
                context.write(value, key);
            }
        }
       
        public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{
            protected void reduce(Text key, Iterable<Text> values, Context context)
                    throws IOException, InterruptedException {
                String csv = “”;
                Iterator<Text> iterator = values.iterator();
                while(iterator.hasNext()){
                    if(csv.length() > 0 ) csv += “,”;
                    csv += iterator.next().toString();
                }
                context.write(key, new Text(csv));
            }
        }
       
        private  void deleteFilesInDirectory(File f) throws IOException {
            if (f.isDirectory()) {
                for (File c : f.listFiles())
                    deleteFilesInDirectory(c);
            }
            if (!f.delete())
                throw new FileNotFoundException(“Failed to delete file: ” + f);
        }
       
        @Override
        public int run(String[] args) throws Exception {
            if(args.length == 0)
                throw new IllegalArgumentException(“Please provide input and output paths”);
           
            Path inputPath = new Path(args[0]);
            File outputDir = new File(args[1]);
            deleteFilesInDirectory(outputDir);
            Path outputPath = new Path(args[1]);

            Job job = new Job(getConf(), “Hadoop Patent Citation Example”);
            job.setJarByClass(PatentCitation.class);

            FileInputFormat.setInputPaths(job, inputPath);
            FileOutputFormat.setOutputPath(job, outputPath);

            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);
           
            job.setMapperClass(PatentCitationMapper.class);
            job.setReducerClass(PatentCitationReducer.class);
           
            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            return job.waitForCompletion(false) ? 0 : -1;
        }

        public static void main(String[] args) throws Exception {
            System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));
        }
    }

    I created a directory called “input” where the cite75_99.txt was placed.  I also created an empty directory called “output”.  These directories form the input and output directories for the M/R program

    Execution

    First Attempt:

    I executed the program as is. It choked because my /tmp directory and root filesystem exhausted the disk space.

    Second Attempt:

    Now I exclusively configure the hadoop tmp directory so I can place the tmp files wherever I want.
    -Dhadoop.tmp.dir=/home/anil/judcon12/tmp

    Ok, now the program did not choke.  It just ran and ran for 2+ hours. I killed it. It seems the data set is too large to get it finished in a short duration.
    The culprit was the reduce phase. It just did not finish.

    Third Attempt:

    Now I tried to configure the reducers to zero so I can view the output of the map phase.

    I tried the property -Dmapred.reduce.tasks=0.  It made no difference.

    I then added the following deprecated usage of the Job class. That worked.

            job.setNumReduceTasks(0);

    Ok, now the program just undertook the map phase.

    =======================
    May 25, 2012 4:40:42 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
    WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
    WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
    INFO: Total input paths to process : 1
    May 25, 2012 4:40:42 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
    WARNING: Snappy native library not loaded
    May 25, 2012 4:40:42 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
    INFO: setsid exited with exit code 0
    May 25, 2012 4:40:42 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000000_0 is allowed to commit now
    May 25, 2012 4:40:44 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000000_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000000_0′ done.
    May 25, 2012 4:40:45 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@527736bd
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000001_0 is allowed to commit now
    May 25, 2012 4:40:46 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000001_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000001_0′ done.
    May 25, 2012 4:40:48 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@5749b290
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000002_0 is done. And is in the process of commiting
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000002_0 is allowed to commit now
    May 25, 2012 4:40:49 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000002_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000002_0′ done.
    May 25, 2012 4:40:51 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2a8ceeea
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000003_0 is done. And is in the process of commiting
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000003_0 is allowed to commit now
    May 25, 2012 4:40:52 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000003_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000003_0′ done.
    May 25, 2012 4:40:54 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@46238a47
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000004_0 is done. And is in the process of commiting
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000004_0 is allowed to commit now
    May 25, 2012 4:40:55 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000004_0′ to /home/anil/judcon12/output
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000004_0′ done.
    May 25, 2012 4:40:57 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@559113f8
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000005_0 is done. And is in the process of commiting
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000005_0 is allowed to commit now
    May 25, 2012 4:40:58 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000005_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000005_0′ done.
    May 25, 2012 4:41:00 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@76a9b9c
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000006_0 is done. And is in the process of commiting
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000006_0 is allowed to commit now
    May 25, 2012 4:41:01 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000006_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000006_0′ done.
    May 25, 2012 4:41:03 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@560c3014
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000007_0 is done. And is in the process of commiting
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_m_000007_0 is allowed to commit now
    May 25, 2012 4:41:04 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_m_000007_0′ to /home/anil/judcon12/output
    May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 4:41:06 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000007_0′ done.
    =======================

    So the Map phase is taking between 1-2mins.  It is the reduce phase that does not end for me.  I will evaluate as to why that is the case. :)

    Let us see what the output of the map phase looks like:

    =============================
    anil@sadbhav:~/judcon12/output$ ls
    part-m-00000  part-m-00002  part-m-00004  part-m-00006  _SUCCESS
    part-m-00001  part-m-00003  part-m-00005  part-m-00007
    ==============================

    You can see the end result of the map phase in these files.

    Now onto figuring out the number of reduce phases.

    Let me look at the file size

    ===================
     anil@sadbhav:~/judcon12/input$ wc -l cite75_99.txt
    16522439 cite75_99.txt
    ===================

    OMG!  that is something like 16 million 522 thousand entries.  Too much for a laptop to handle.

    Let us try another experiment.  Let us choose 20000 patent citations from this file and save it to 20000cite.txt

    When I run the program now,  I see that the M/R execution took all of 10 secs to finish.

    Let us view the results of the M/R execution.

    =============================
    anil@sadbhav:~/judcon12/output$ ls
    part-r-00000  _SUCCESS
    ============================

    When I look inside the part-r-00000 ,  I see one long line of csv patent citations.  Yeah, you are right.  My reducer is busted.  It is not working.  I need to fix it…..
    That is next step….  This exercise was a failure. But there was a lesson here.  If you mess up, you will wait. :)

    Ok,  here is the updated map reduce program that works:
    ======================
    package mapreduce;

    import java.io.File;
    import java.io.FileNotFoundException;
    import java.io.IOException;

    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.conf.Configured;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
    import org.apache.hadoop.util.Tool;
    import org.apache.hadoop.util.ToolRunner;

    public class PatentCitation extends Configured implements Tool{

        public static class PatentCitationMapper extends Mapper<Text,Text,Text,Text> {
            protected void map(Text key, Text value, Context context)
                    throws IOException, InterruptedException {

                String[] citation = key.toString().split(“,”);
                context.write(new Text(citation[1]), new Text(citation[0]));
            }
        }

        public static class PatentCitationReducer extends Reducer<Text,Text,Text,Text>{
            protected void reduce(Text key, Iterable<Text> values, Context context)
                    throws IOException, InterruptedException {
                String csv = “”;
                for(Text val:values){
                    if(csv.length() > 0 ) csv += “,”;
                    csv += val.toString();
                }
                context.write(key, new Text(csv));
            }
        }

        private  void deleteFilesInDirectory(File f) throws IOException {
            if (f.isDirectory()) {
                for (File c : f.listFiles())
                    deleteFilesInDirectory(c);
            }
            if (!f.delete())
                throw new FileNotFoundException(“Failed to delete file: ” + f);
        }

        @Override
        public int run(String[] args) throws Exception {
            if(args.length == 0)
                throw new IllegalArgumentException(“Please provide input and output paths”);

            Path inputPath = new Path(args[0]);
            File outputDir = new File(args[1]);
            deleteFilesInDirectory(outputDir);
            Path outputPath = new Path(args[1]);

            Job job = new Job(getConf(), “Hadoop Patent Citation Example”);
            job.setJarByClass(PatentCitation.class);

            FileInputFormat.setInputPaths(job, inputPath);
            FileOutputFormat.setOutputPath(job, outputPath);

            job.setInputFormatClass(KeyValueTextInputFormat.class);
            job.setOutputFormatClass(TextOutputFormat.class);

            job.setMapperClass(PatentCitationMapper.class);
            job.setReducerClass(PatentCitationReducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            //job.setNumReduceTasks(10000);

            return job.waitForCompletion(false) ? 0 : -1;
        }

        public static void main(String[] args) throws Exception {
            System.exit(ToolRunner.run(new Configuration(), new PatentCitation(), args));
        }
    }
    =======================================

    Running updated program

    ================================
    May 25, 2012 6:14:26 PM org.apache.hadoop.util.NativeCodeLoader <clinit>
    WARNING: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    May 25, 2012 6:14:26 PM org.apache.hadoop.mapred.JobClient copyAndConfigureFiles
    WARNING: No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    May 25, 2012 6:14:26 PM org.apache.hadoop.mapreduce.lib.input.FileInputFormat listStatus
    INFO: Total input paths to process : 1
    May 25, 2012 6:14:26 PM org.apache.hadoop.io.compress.snappy.LoadSnappy <clinit>
    WARNING: Snappy native library not loaded
    May 25, 2012 6:14:27 PM org.apache.hadoop.util.ProcessTree isSetsidSupported
    INFO: setsid exited with exit code 0
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@420f9c40
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: io.sort.mb = 100
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: data buffer = 79691776/99614720
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer <init>
    INFO: record buffer = 262144/327680
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer flush
    INFO: Starting flush of map output
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.MapTask$MapOutputBuffer sortAndSpill
    INFO: Finished spill 0
    May 25, 2012 6:14:27 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_m_000000_0′ done.
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task initialize
    INFO:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@3a56860b
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
    INFO: Merging 1 sorted segments
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Merger$MergeQueue merge
    INFO: Down to the last merge-pass, with 1 segments left of total size: 359270 bytes
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task done
    INFO: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO:
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapred.Task commit
    INFO: Task attempt_local_0001_r_000000_0 is allowed to commit now
    May 25, 2012 6:14:30 PM org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter commitTask
    INFO: Saved output of task ‘attempt_local_0001_r_000000_0′ to /home/anil/judcon12/output
    May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.LocalJobRunner$Job statusUpdate
    INFO: reduce > reduce
    May 25, 2012 6:14:33 PM org.apache.hadoop.mapred.Task sendDone
    INFO: Task ‘attempt_local_0001_r_000000_0′ done.
    =================================

    Let us look at the output:
    ===============
    anil@sadbhav:~/judcon12/output$ ls
    part-r-00000  _SUCCESS
    ================

    If you look inside the part-xxx file, you will the results:

    ======================
     ”CITED” “CITING”
    1000715 3861270
    1001069 3858600
    1001170 3861317
    1001597 3861811
    1004288 3861154
    1006393 3861066
    1006952 3860293
    …….
    1429311 3861187
    1429835 3860154
    1429968 3860060
    1430491 3859976
    1431444 3861601
    1431718 3859022
    1432243 3861774
    1432467 3862478
    1433649 3861223,3861222
    1433923 3861232
    1434088 3862293
    1435134 3861526
    1435144 3858398
    ….
    ======================

    Woot!

    Additional Information

    If you see the following error, it means you have disk space issues.

    org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/spill0.out

    How do I set my own tmp directory for Hadoop?
    -Dhadoop.tmp.dir=
    or
    -Dmapred.local.dir=    (for map/reduce)

    References

    http://blog.hampisoftware.com

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- May 01, 2014 No Comments
  • Working With Apache Solr

    Background

    Apache Solr is based on Apache Lucene. It is a search server. You can index data in Solr and then run queries.

    Installing Apache Solr

    Download Apache Solr from an Apache Mirror.  At the time of writing, the version was 4.7.0

    Start Apache Solr

    You can start solr using a default embedded jetty instance by going to the examples directory of your Solr installation.

    $> java -jar start.jar

    If there are no errors, your Solr instance is available at http://localhost:8983/solr/#/

    You should see the default Solr Welcome Screen.

    Exploring the collections

    On the left hand column, use the drop down to choose the default colletion “Collection1″.

    You should see a screen for the collection.

    Click on “query” on the left hand side.

    You should see the query screen.

    Press the “Execute query” button.

    You should see the query response in JSON format as follows:

    {  "responseHeader": {    "status": 0,    "QTime": 2,    "params": {      "indent": "true",      "q": "*:*",      "_": "1396205101116",      "wt": "json"    }  },  "response": {    "numFound": 0,    "start": 0,    "docs": []  }}

    The response shows that we have no data.

    This is because we have not fed Apache Solr any data to index.

    Index some data

    We have Apache Solr running. We will try to index some data.

    In another command window, let us go solr-install-dir/examples/exampledocs directory.

    solr-4.7.0/example/exampledocs$ java -jar post.jar .SimplePostTool version 1.5Posting files to base url http://localhost:8983/solr/update using content-type application/xml..Indexing directory . (16 files, depth=0)POSTing file books.csvSimplePostTool: WARNING: Solr returned an error #400 Bad RequestSimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/updatePOSTing file books.jsonSimplePostTool: WARNING: Solr returned an error #400 Bad RequestSimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/updatePOSTing file gb18030-example.xmlPOSTing file hd.xmlPOSTing file ipod_other.xmlPOSTing file ipod_video.xmlPOSTing file manufacturers.xmlPOSTing file mem.xmlPOSTing file money.xmlPOSTing file monitor.xmlPOSTing file monitor2.xmlPOSTing file mp500.xmlPOSTing file sd500.xmlPOSTing file solr.xmlPOSTing file utf8-example.xmlPOSTing file vidcard.xml16 files indexed.COMMITting Solr index changes to http://localhost:8983/solr/update..Time spent: 0:00:00.413anil@anil:~/solr/solr-4.7.0/example/exampledocs$

    Basically Apache Solr is indexed with all the files available in examples/exampledocs directory.

     Testing the Indexed Data

    Now that Solr is indexed with some data, we can send queries.

    In the Solr admin screen in the browser, just click the “Execute Query” button. This should return all the data that is available.

    {  "responseHeader": {    "status": 0,    "QTime": 5,    "params": {      "indent": "true",      "q": "*:*",      "_": "1396205796896",      "wt": "json"    }  },  "response": {    "numFound": 32,    "start": 0,    "docs": [      {        "id": "GB18030TEST",        "name": "Test with some GB18030 encoded characters",        "features": [          "No accents here",          "这是一个功能",          "This is a feature (translated)",          "这份文件是很有光泽",          "This document is very shiny (translated)"        ],        "price": 0,        "price_c": "0,USD",        "inStock": true,        "_version_": 1464027600530702300      },      {        "id": "SP2514N",        "name": "Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133",        "manu": "Samsung Electronics Co. Ltd.",        "manu_id_s": "samsung",        "cat": [          "electronics",          "hard drive"        ],        "features": [          "7200RPM, 8MB cache, IDE Ultra ATA-133",          "NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor"        ],        "price": 92,        "price_c": "92,USD",        "popularity": 6,        "inStock": true,        "manufacturedate_dt": "2006-02-13T15:26:37Z",        "store": "35.0752,-97.032",        "_version_": 1464027600570548200      },      {        "id": "6H500F0",        "name": "Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300",        "manu": "Maxtor Corp.",        "manu_id_s": "maxtor",        "cat": [          "electronics",          "hard drive"        ],        "features": [          "SATA 3.0Gb/s, NCQ",          "8.5ms seek",          "16MB cache"        ],        "price": 350,        "price_c": "350,USD",        "popularity": 6,        "inStock": true,        "store": "45.17614,-93.87341",        "manufacturedate_dt": "2006-02-13T15:26:37Z",        "_version_": 1464027600579985400      },      {        "id": "F8V7067-APL-KIT",        "name": "Belkin Mobile Power Cord for iPod w/ Dock",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter, white"        ],        "weight": 4,        "price": 19.95,        "price_c": "19.95,USD",        "popularity": 1,        "inStock": false,        "store": "45.18014,-93.87741",        "manufacturedate_dt": "2005-08-01T16:30:25Z",        "_version_": 1464027600588374000      },      {        "id": "IW-02",        "name": "iPod & iPod Mini USB 2.0 Cable",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter for iPod, white"        ],        "weight": 2,        "price": 11.5,        "price_c": "11.50,USD",        "popularity": 1,        "inStock": false,        "store": "37.7752,-122.4232",        "manufacturedate_dt": "2006-02-14T23:55:59Z",        "_version_": 1464027600592568300      },      {        "id": "MA147LL/A",        "name": "Apple 60 GB iPod with Video Playback Black",        "manu": "Apple Computer Inc.",        "manu_id_s": "apple",        "cat": [          "electronics",          "music"        ],        "features": [          "iTunes, Podcasts, Audiobooks",          "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",          "2.5-inch, 320x240 color TFT LCD display with LED backlight",          "Up to 20 hours of battery life",          "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",          "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"        ],        "includes": "earbud headphones, USB cable",        "weight": 5.5,        "price": 399,        "price_c": "399.00,USD",        "popularity": 10,        "inStock": true,        "store": "37.7752,-100.0232",        "manufacturedate_dt": "2005-10-12T08:00:00Z",        "_version_": 1464027600599908400      },      {        "id": "adata",        "compName_s": "A-Data Technology",        "address_s": "46221 Landing Parkway Fremont, CA 94538",        "_version_": 1464027600616685600      },      {        "id": "apple",        "compName_s": "Apple",        "address_s": "1 Infinite Way, Cupertino CA",        "_version_": 1464027600618782700      },      {        "id": "asus",        "compName_s": "ASUS Computer",        "address_s": "800 Corporate Way Fremont, CA 94539",        "_version_": 1464027600619831300      },      {        "id": "ati",        "compName_s": "ATI Technologies",        "address_s": "33 Commerce Valley Drive East Thornhill, ON L3T 7N6 Canada",        "_version_": 1464027600620880000      }    ]  }}

    Above we have just sent a query for all data.

    Let us try to be specific with our queries.

    In the edit box named “q”, enter the following word: ipod and click the “Execute Query” button, you should see the following data returned as JSON response.

    {  "responseHeader": {    "status": 0,    "QTime": 8,    "params": {      "indent": "true",      "q": "ipod",      "_": "1396206251386",      "wt": "json"    }  },  "response": {    "numFound": 3,    "start": 0,    "docs": [      {        "id": "IW-02",        "name": "iPod & iPod Mini USB 2.0 Cable",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter for iPod, white"        ],        "weight": 2,        "price": 11.5,        "price_c": "11.50,USD",        "popularity": 1,        "inStock": false,        "store": "37.7752,-122.4232",        "manufacturedate_dt": "2006-02-14T23:55:59Z",        "_version_": 1464027600592568300      },      {        "id": "F8V7067-APL-KIT",        "name": "Belkin Mobile Power Cord for iPod w/ Dock",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter, white"        ],        "weight": 4,        "price": 19.95,        "price_c": "19.95,USD",        "popularity": 1,        "inStock": false,        "store": "45.18014,-93.87741",        "manufacturedate_dt": "2005-08-01T16:30:25Z",        "_version_": 1464027600588374000      },      {        "id": "MA147LL/A",        "name": "Apple 60 GB iPod with Video Playback Black",        "manu": "Apple Computer Inc.",        "manu_id_s": "apple",        "cat": [          "electronics",          "music"        ],        "features": [          "iTunes, Podcasts, Audiobooks",          "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",          "2.5-inch, 320x240 color TFT LCD display with LED backlight",          "Up to 20 hours of battery life",          "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",          "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"        ],        "includes": "earbud headphones, USB cable",        "weight": 5.5,        "price": 399,        "price_c": "399.00,USD",        "popularity": 10,        "inStock": true,        "store": "37.7752,-100.0232",        "manufacturedate_dt": "2005-10-12T08:00:00Z",        "_version_": 1464027600599908400      }    ]  }}

    Basically we are now returned all the data containing the word “ipod”.

    Tips

    1. By default, Solr Search returns 10 results. If you want to return all the values, just use “&rows=100000″ or a high value.

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Mar 30, 2014 No Comments
  • Working With Apache Solr

    Background

    Apache Solr is based on Apache Lucene. It is a search server. You can index data in Solr and then run queries.

    Installing Apache Solr

    Download Apache Solr from an Apache Mirror.  At the time of writing, the version was 4.7.0

    Start Apache Solr

    You can start solr using a default embedded jetty instance by going to the examples directory of your Solr installation.

    $> java -jar start.jar

    If there are no errors, your Solr instance is available at http://localhost:8983/solr/#/

    You should see the default Solr Welcome Screen.

    Exploring the collections

    On the left hand column, use the drop down to choose the default colletion “Collection1″.

    You should see a screen for the collection.

    Click on “query” on the left hand side.

    You should see the query screen.

    Press the “Execute query” button.

    You should see the query response in JSON format as follows:

    {  "responseHeader": {    "status": 0,    "QTime": 2,    "params": {      "indent": "true",      "q": "*:*",      "_": "1396205101116",      "wt": "json"    }  },  "response": {    "numFound": 0,    "start": 0,    "docs": []  }}

    The response shows that we have no data.

    This is because we have not fed Apache Solr any data to index.

    Index some data

    We have Apache Solr running. We will try to index some data.

    In another command window, let us go solr-install-dir/examples/exampledocs directory.

    solr-4.7.0/example/exampledocs$ java -jar post.jar .SimplePostTool version 1.5Posting files to base url http://localhost:8983/solr/update using content-type application/xml..Indexing directory . (16 files, depth=0)POSTing file books.csvSimplePostTool: WARNING: Solr returned an error #400 Bad RequestSimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/updatePOSTing file books.jsonSimplePostTool: WARNING: Solr returned an error #400 Bad RequestSimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP response code: 400 for URL: http://localhost:8983/solr/updatePOSTing file gb18030-example.xmlPOSTing file hd.xmlPOSTing file ipod_other.xmlPOSTing file ipod_video.xmlPOSTing file manufacturers.xmlPOSTing file mem.xmlPOSTing file money.xmlPOSTing file monitor.xmlPOSTing file monitor2.xmlPOSTing file mp500.xmlPOSTing file sd500.xmlPOSTing file solr.xmlPOSTing file utf8-example.xmlPOSTing file vidcard.xml16 files indexed.COMMITting Solr index changes to http://localhost:8983/solr/update..Time spent: 0:00:00.413anil@anil:~/solr/solr-4.7.0/example/exampledocs$

    Basically Apache Solr is indexed with all the files available in examples/exampledocs directory.

     Testing the Indexed Data

    Now that Solr is indexed with some data, we can send queries.

    In the Solr admin screen in the browser, just click the “Execute Query” button. This should return all the data that is available.

    {  "responseHeader": {    "status": 0,    "QTime": 5,    "params": {      "indent": "true",      "q": "*:*",      "_": "1396205796896",      "wt": "json"    }  },  "response": {    "numFound": 32,    "start": 0,    "docs": [      {        "id": "GB18030TEST",        "name": "Test with some GB18030 encoded characters",        "features": [          "No accents here",          "这是一个功能",          "This is a feature (translated)",          "这份文件是很有光泽",          "This document is very shiny (translated)"        ],        "price": 0,        "price_c": "0,USD",        "inStock": true,        "_version_": 1464027600530702300      },      {        "id": "SP2514N",        "name": "Samsung SpinPoint P120 SP2514N - hard drive - 250 GB - ATA-133",        "manu": "Samsung Electronics Co. Ltd.",        "manu_id_s": "samsung",        "cat": [          "electronics",          "hard drive"        ],        "features": [          "7200RPM, 8MB cache, IDE Ultra ATA-133",          "NoiseGuard, SilentSeek technology, Fluid Dynamic Bearing (FDB) motor"        ],        "price": 92,        "price_c": "92,USD",        "popularity": 6,        "inStock": true,        "manufacturedate_dt": "2006-02-13T15:26:37Z",        "store": "35.0752,-97.032",        "_version_": 1464027600570548200      },      {        "id": "6H500F0",        "name": "Maxtor DiamondMax 11 - hard drive - 500 GB - SATA-300",        "manu": "Maxtor Corp.",        "manu_id_s": "maxtor",        "cat": [          "electronics",          "hard drive"        ],        "features": [          "SATA 3.0Gb/s, NCQ",          "8.5ms seek",          "16MB cache"        ],        "price": 350,        "price_c": "350,USD",        "popularity": 6,        "inStock": true,        "store": "45.17614,-93.87341",        "manufacturedate_dt": "2006-02-13T15:26:37Z",        "_version_": 1464027600579985400      },      {        "id": "F8V7067-APL-KIT",        "name": "Belkin Mobile Power Cord for iPod w/ Dock",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter, white"        ],        "weight": 4,        "price": 19.95,        "price_c": "19.95,USD",        "popularity": 1,        "inStock": false,        "store": "45.18014,-93.87741",        "manufacturedate_dt": "2005-08-01T16:30:25Z",        "_version_": 1464027600588374000      },      {        "id": "IW-02",        "name": "iPod & iPod Mini USB 2.0 Cable",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter for iPod, white"        ],        "weight": 2,        "price": 11.5,        "price_c": "11.50,USD",        "popularity": 1,        "inStock": false,        "store": "37.7752,-122.4232",        "manufacturedate_dt": "2006-02-14T23:55:59Z",        "_version_": 1464027600592568300      },      {        "id": "MA147LL/A",        "name": "Apple 60 GB iPod with Video Playback Black",        "manu": "Apple Computer Inc.",        "manu_id_s": "apple",        "cat": [          "electronics",          "music"        ],        "features": [          "iTunes, Podcasts, Audiobooks",          "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",          "2.5-inch, 320x240 color TFT LCD display with LED backlight",          "Up to 20 hours of battery life",          "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",          "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"        ],        "includes": "earbud headphones, USB cable",        "weight": 5.5,        "price": 399,        "price_c": "399.00,USD",        "popularity": 10,        "inStock": true,        "store": "37.7752,-100.0232",        "manufacturedate_dt": "2005-10-12T08:00:00Z",        "_version_": 1464027600599908400      },      {        "id": "adata",        "compName_s": "A-Data Technology",        "address_s": "46221 Landing Parkway Fremont, CA 94538",        "_version_": 1464027600616685600      },      {        "id": "apple",        "compName_s": "Apple",        "address_s": "1 Infinite Way, Cupertino CA",        "_version_": 1464027600618782700      },      {        "id": "asus",        "compName_s": "ASUS Computer",        "address_s": "800 Corporate Way Fremont, CA 94539",        "_version_": 1464027600619831300      },      {        "id": "ati",        "compName_s": "ATI Technologies",        "address_s": "33 Commerce Valley Drive East Thornhill, ON L3T 7N6 Canada",        "_version_": 1464027600620880000      }    ]  }}

    Above we have just sent a query for all data.

    Let us try to be specific with our queries.

    In the edit box named “q”, enter the following word: ipod and click the “Execute Query” button, you should see the following data returned as JSON response.

    {  "responseHeader": {    "status": 0,    "QTime": 8,    "params": {      "indent": "true",      "q": "ipod",      "_": "1396206251386",      "wt": "json"    }  },  "response": {    "numFound": 3,    "start": 0,    "docs": [      {        "id": "IW-02",        "name": "iPod & iPod Mini USB 2.0 Cable",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter for iPod, white"        ],        "weight": 2,        "price": 11.5,        "price_c": "11.50,USD",        "popularity": 1,        "inStock": false,        "store": "37.7752,-122.4232",        "manufacturedate_dt": "2006-02-14T23:55:59Z",        "_version_": 1464027600592568300      },      {        "id": "F8V7067-APL-KIT",        "name": "Belkin Mobile Power Cord for iPod w/ Dock",        "manu": "Belkin",        "manu_id_s": "belkin",        "cat": [          "electronics",          "connector"        ],        "features": [          "car power adapter, white"        ],        "weight": 4,        "price": 19.95,        "price_c": "19.95,USD",        "popularity": 1,        "inStock": false,        "store": "45.18014,-93.87741",        "manufacturedate_dt": "2005-08-01T16:30:25Z",        "_version_": 1464027600588374000      },      {        "id": "MA147LL/A",        "name": "Apple 60 GB iPod with Video Playback Black",        "manu": "Apple Computer Inc.",        "manu_id_s": "apple",        "cat": [          "electronics",          "music"        ],        "features": [          "iTunes, Podcasts, Audiobooks",          "Stores up to 15,000 songs, 25,000 photos, or 150 hours of video",          "2.5-inch, 320x240 color TFT LCD display with LED backlight",          "Up to 20 hours of battery life",          "Plays AAC, MP3, WAV, AIFF, Audible, Apple Lossless, H.264 video",          "Notes, Calendar, Phone book, Hold button, Date display, Photo wallet, Built-in games, JPEG photo playback, Upgradeable firmware, USB 2.0 compatibility, Playback speed control, Rechargeable capability, Battery level indication"        ],        "includes": "earbud headphones, USB cable",        "weight": 5.5,        "price": 399,        "price_c": "399.00,USD",        "popularity": 10,        "inStock": true,        "store": "37.7752,-100.0232",        "manufacturedate_dt": "2005-10-12T08:00:00Z",        "_version_": 1464027600599908400      }    ]  }}

    Basically we are now returned all the data containing the word “ipod”.

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Jan 19, 2014 No Comments
  • Apache Hadoop Security
    Apache Hadoop is synonymous with Big Data. Majority of Big Data processing happens via the Hadoop ecosystem. If you have a Big Data project, chances of you using elements of the Hadoop ecosystem is very huge.

    One of the biggest challenges with the Hadoop ecosystem is security. The Map/Reduce processing framework in Hadoop does not really have major security support. HDFS uses the Unix file system model security. The HDFS security model may work perfectly for storing data in a distributed file system.

    Kerberos based authentication is used as the primary security mechanism in the Hadoop Map/Reduce framework. But there is no real data confidentiality/privacy mechanisms in supported.

    Data can be in motion (passing through programs) or network elements. Data can be at rest in data stores. The HDFS security controls seem to be adequate for data at rest.  But for data that is being processed via the Map/Reduce mechanism, it is up to the developers/programmers to utilize encryption mechanisms.

    If you need guidance, please do email me at anil  AT  apache  DOT org.  I will be happy to suggest approaches to achieve Big Data Security.

    Hadoop By White, Tom (Google Affiliate Ad)Hadoop in Action By Lam, Chuck (Google Affiliate Ad)

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Sep 02, 2013 No Comments
  • GeoFencing : trend in big data
    GeoFencing in my opinion will be an exciting trend in the world of Big Data, particularly in retail and customer loyalty. Reading many online articles, I see that this is already being put into place by retailers.

    A typical usage of geofencing would be : you enter a store, based on your earlier permissions to tag you, the store will determine that you are in the vicinity. The store’s app or sms will deliver the latest coupons or deals to increase your loyalty.

    Android developers have important guidance for geofencing. It is at http://developer.android.com/training/location/geofencing.html

    Geofencing involves the use of GPS or some other tracking device. It does involve specialized software. Important paradigms in play are customer loyalty and context based discount.

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Jan 28, 2013 No Comments
  • Apache HBase Performance Considerations
    As you know Apache HBase is a columnar database in the Hadoop ecosystem.  Since a column family can store different types of data, it is very important to understand the various performance options you have in terms of compression at the column level.

    Please refer to http://jimbojw.com/wiki/index.php?title=Understanding_HBase_column-family_performance_options  for an excellent writeup on the various column family performance options.

  • Apache HBase – a simple tutorial
    Apache HBase is a Column Database in the Hadoop ecosystem.  You can take a look at Apache HBase from its website at http://hbase.apache.org/

    HBase Operations

    Step 1: Download HBase

    I downloaded hbase-0.94.4. This was the latest this day. You may get a later version.

    Step 2: Unzip HBase

    $> mkdir hbase
    $> gunzip hbase-0.94.4.tar.gz
    $> ls
    hbase-0.94.4.tar

    $> tar xvf hbase-0.94.4.tar

    Now you should have a directory called hbase-0.94.4

    $> cd hbase-0.94.4

    Step 3:  Start HBase Daemon

    $> cd bin
    $> ./hbase-daemon.sh start master
    starting master, logging to  …/hbase-0.94.4/bin/../logs/hbase-anil-master-2.local.out
    $

    Step 4:  Enter HBase Shell

    $> ./hbase shell

    HBase Shell; enter ‘help<RETURN>’ for list of supported commands.
    Type “exit<RETURN>” to leave the HBase Shell
    Version 0.94.4, r1428173, Thu Jan  3 06:29:56 UTC 2013

    hbase(main):001:0>

    Step 5:  Create an HBase Table 

    Table  will be called blog with a column family called “posts” and another column family called “images”

    hbase(main):007:0> create ‘blog’, ‘posts’, ‘images’
    0 row(s) in 1.0610 seconds

    Step 6: Populate the HBase Table

    hbase(main):009:0> put ‘blog’,'firstpost’,'posts:title’,'My HBase Post’
    0 row(s) in 0.0220 seconds

    hbase(main):009:0> put ‘blog’,'firstpost’,'posts:title’,'My HBase Post’
    0 row(s) in 0.0220 seconds

    hbase(main):010:0> put ‘blog’,'firstpost’,'posts:author’,'Anil’
    0 row(s) in 0.0050 seconds

    hbase(main):011:0> put ‘blog’,'firstpost’,'posts:location’,'Chicago’
    0 row(s) in 0.0070 seconds

    hbase(main):012:0> put ‘blog’,'firstpost’,'posts:content’,'HBase is cool’
    0 row(s) in 0.0050 seconds

    hbase(main):014:0> put ‘blog’,'firstpost’,'images:header’, ‘first.jpg’
    0 row(s) in 0.0060 seconds

    hbase(main):015:0> put ‘blog’,'firstpost’,'images:bodyimage’, ‘second.jpg’
    0 row(s) in 0.0040 seconds

    INFO ON HBASE CELL INSERTION FORMAT
    NOTE:  Put a cell ‘value’ at specified table/row/column and optionally
    timestamp coordinates.  To put a cell value into table ‘t1′ at
    row ‘r1′ under column ‘c1′ marked with the time ‘ts1′, do:
            hbase> put ‘t1′, ‘r1′, ‘c1′, ‘value’, ts1

    Step 7:  Verify the HBase Table Contents

    hbase(main):016:0> get ‘blog’,'firstpost’
    COLUMN                CELL                                                    
     images:bodyimage     timestamp=1359347351382, value=second.jpg              
     images:header        timestamp=1359347324836, value=first.jpg                
     posts:author         timestamp=1359347197336, value=Anil                    
     posts:content        timestamp=1359347230734, value=HBase is cool            
     posts:location       timestamp=1359347210258, value=Chicago                  
     posts:title          timestamp=1359347161523, value=My HBase Post            
    6 row(s) in 0.0350 seconds

    hbase(main):017:0>

    Cleaning Up

    To delete the hbase table you created above, you need to first disable and then drop

    hbase(main):005:0> disable ‘blog’
    0 row(s) in 2.0560 seconds

    hbase(main):006:0> drop ‘blog’
    0 row(s) in 1.0560 seconds

    Troubleshooting

    If you make a mistake in the column name, you may see an error like this:

    ERROR: org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column family image does not exist in region blog,,1359346963541.261ada3f5ada71f241759e6a062dc523. in table {NAME => ‘blog’, FAMILIES => [{NAME => 'images', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}, {NAME => 'posts', DATA_BLOCK_ENCODING => 'NONE', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', MIN_VERSIONS => '0', KEEP_DELETED_CELLS => 'false', BLOCKSIZE => '65536', ENCODE_ON_DISK => 'true', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}

    HBase REST Server

    If you are interested in starting the HBase Server as a REST Server,

    Start the RegionServer

     ./hbase-daemon.sh start regionserver
    starting regionserver, logging to hbase-0.94.4/bin/../logs/hbase-anil-regionserver-2.local.out
    $

    Start HBase REST Server

    $ ./hbase-daemon.sh start rest -p 50000

    NOTE:  You can use any port. I use 50000 for the rest server.

    So when I go to http://localhost:50000
    I see my hbase tables.

    When I go to  http://localhost:50000/version
    it gives me some version metadata info.

    Stop HBase REST Server

    $ ./hbase-daemon.sh stop rest -p 50000
    stopping rest..

    Stop HBase Master

    $ ./hbase-daemon.sh stop master
    stopping master.

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Dec 28, 2012 No Comments
  • Trying out Apache Pig
    I was going through Alex P’s blog posts and one post that attracted my attention was related to Apache Pig.  I have been thinking of playing with Pig Latin Scripts to simulate Map Reduce functionality.

    I tried to use pig to run Alex’s pig script.  He just gives the input values and the output from Pig along with the Pig script.  There is no information on how to use Pig. That is fine. He just wants reader to go through the Pig manual. :)

    Here is what I tried out:

    1)  Downloaded Apache Pig 0.9.2  (that was the latest version).
    2) The script from Alex uses PiggyBank which is in Pig Contrib directory. Looks like I will have to build Pig.

    =====================
    pig_directory $>  ant

    [javacc] Java Compiler Compiler Version 4.2 (Parser Generator)
       [javacc] (type “javacc” with no arguments for help)

       [javacc] File “SimpleCharStream.java” is being rebuilt.
       [javacc] Parser generated successfully.

    prepare:
        [mkdir] Created dir: xxx/pig-0.9.2/src-gen/org/apache/pig/parser

    genLexer:

    genParser:

    genTreeParser:

    gen:

    compile:
         [echo] *** Building Main Sources ***
         [echo] *** To compile with all warnings enabled, supply -Dall.warnings=1 on command line ***
         [echo] *** If all.warnings property is supplied, compile-sources-all-warnings target will be executed ***
         [echo] *** Else, compile-sources (which only warns about deprecations) target will be executed ***
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    compile-sources:
        [javac] xxx/pig-0.9.2/build.xml:429: warning: ‘includeantruntime’ was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
        [javac] Compiling 667 source files to /home/anil/hadoop/pig/pig-0.9.2/build/classes
        [javac] Note: Some input files use or override a deprecated API.
        [javac] Note: Recompile with -Xlint:deprecation for details.
        [javac] Note: Some input files use unchecked or unsafe operations.
        [javac] Note: Recompile with -Xlint:unchecked for details.
         [copy] Copying 1 file to xxx/pig-0.9.2/build/classes/org/apache/pig/tools/grunt
         [copy] Copying 1 file to xxx/pig-0.9.2/build/classes/org/apache/pig/tools/grunt
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    compile-sources-all-warnings:

    jar:
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    jarWithSvn:
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    ivy-download:
          [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
          [get] To: /home/anil/pig/pig-0.9.2/ivy/ivy-2.2.0.jar
          [get] Not modified – so not downloaded

    ivy-init-dirs:

    ivy-probe-antlib:

    ivy-init-antlib:

    ivy-init:

    ivy-buildJar:
    [ivy:resolve] :: resolving dependencies :: org.apache.pig#Pig;0.9.3-SNAPSHOT
    [ivy:resolve]     confs: [buildJar]
    [ivy:resolve]     found com.sun.jersey#jersey-core;1.8 in maven2
    [ivy:resolve]     found org.apache.hadoop#hadoop-core;1.0.0 in maven2
    [ivy:resolve]     found commons-cli#commons-cli;1.2 in maven2
    [ivy:resolve]     found xmlenc#xmlenc;0.52 in maven2
    [ivy:resolve]     found commons-httpclient#commons-httpclient;3.0.1 in maven2
    [ivy:resolve]     found commons-codec#commons-codec;1.4 in maven2
    [ivy:resolve]     found org.apache.commons#commons-math;2.1 in maven2
    [ivy:resolve]     found commons-configuration#commons-configuration;1.6 in maven2
    [ivy:resolve]     found commons-collections#commons-collections;3.2.1 in maven2
    [ivy:resolve]     found commons-lang#commons-lang;2.4 in maven2
    [ivy:resolve]     found commons-logging#commons-logging;1.1.1 in maven2
    [ivy:resolve]     found commons-digester#commons-digester;1.8 in maven2
    [ivy:resolve]     found commons-beanutils#commons-beanutils;1.7.0 in maven2
    [ivy:resolve]     found commons-beanutils#commons-beanutils-core;1.8.0 in maven2
    [ivy:resolve]     found commons-net#commons-net;1.4.1 in maven2
    [ivy:resolve]     found oro#oro;2.0.8 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jetty;6.1.26 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jetty-util;6.1.26 in maven2
    [ivy:resolve]     found org.mortbay.jetty#servlet-api;2.5-20081211 in maven2
    [ivy:resolve]     found tomcat#jasper-runtime;5.5.12 in maven2
    [ivy:resolve]     found tomcat#jasper-compiler;5.5.12 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jsp-api-2.1;6.1.14 in maven2
    [ivy:resolve]     found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jsp-2.1;6.1.14 in maven2
    [ivy:resolve]     found org.eclipse.jdt#core;3.1.1 in maven2
    [ivy:resolve]     found ant#ant;1.6.5 in maven2
    [ivy:resolve]     found commons-el#commons-el;1.0 in maven2
    [ivy:resolve]     found net.java.dev.jets3t#jets3t;0.7.1 in maven2
    [ivy:resolve]     found net.sf.kosmosfs#kfs;0.3 in maven2
    [ivy:resolve]     found hsqldb#hsqldb;1.8.0.10 in maven2
    [ivy:resolve]     found org.apache.hadoop#hadoop-test;1.0.0 in maven2
    [ivy:resolve]     found org.apache.ftpserver#ftplet-api;1.0.0 in maven2
    [ivy:resolve]     found org.apache.mina#mina-core;2.0.0-M5 in maven2
    [ivy:resolve]     found org.slf4j#slf4j-api;1.5.2 in maven2
    [ivy:resolve]     found org.apache.ftpserver#ftpserver-core;1.0.0 in maven2
    [ivy:resolve]     found org.apache.ftpserver#ftpserver-deprecated;1.0.0-M2 in maven2
    [ivy:resolve]     found log4j#log4j;1.2.16 in maven2
    [ivy:resolve]     found org.slf4j#slf4j-log4j12;1.6.1 in maven2
    [ivy:resolve]     found org.slf4j#slf4j-api;1.6.1 in maven2
    [ivy:resolve]     found org.apache.avro#avro;1.5.3 in maven2
    [ivy:resolve]     found com.googlecode.json-simple#json-simple;1.1 in maven2
    [ivy:resolve]     found com.jcraft#jsch;0.1.38 in maven2
    [ivy:resolve]     found jline#jline;0.9.94 in maven2
    [ivy:resolve]     found net.java.dev.javacc#javacc;4.2 in maven2
    [ivy:resolve]     found org.codehaus.jackson#jackson-mapper-asl;1.7.3 in maven2
    [ivy:resolve]     found org.codehaus.jackson#jackson-core-asl;1.7.3 in maven2
    [ivy:resolve]     found joda-time#joda-time;1.6 in maven2
    [ivy:resolve]     found com.google.guava#guava;11.0 in maven2
    [ivy:resolve]     found org.python#jython;2.5.0 in maven2
    [ivy:resolve]     found rhino#js;1.7R2 in maven2
    [ivy:resolve]     found org.antlr#antlr;3.4 in maven2
    [ivy:resolve]     found org.antlr#antlr-runtime;3.4 in maven2
    [ivy:resolve]     found org.antlr#stringtemplate;3.2.1 in maven2
    [ivy:resolve]     found antlr#antlr;2.7.7 in maven2
    [ivy:resolve]     found org.antlr#ST4;4.0.4 in maven2
    [ivy:resolve]     found org.apache.zookeeper#zookeeper;3.3.3 in maven2
    [ivy:resolve]     found org.jboss.netty#netty;3.2.2.Final in maven2
    [ivy:resolve]     found org.apache.hbase#hbase;0.90.0 in maven2
    [ivy:resolve]     found org.vafer#jdeb;0.8 in maven2
    [ivy:resolve]     found junit#junit;4.5 in maven2
    [ivy:resolve]     found org.apache.hive#hive-exec;0.8.0 in maven2
    [ivy:resolve] downloading http://repo2.maven.org/maven2/junit/junit/4.5/junit-4.5.jar …
    [ivy:resolve] …………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………… (194kB)
    [ivy:resolve] … (0kB)
    [ivy:resolve]     [SUCCESSFUL ] junit#junit;4.5!junit.jar (822ms)
    [ivy:resolve] downloading http://repo2.maven.org/maven2/org/apache/hive/hive-exec/0.8.0/hive-exec-0.8.0.jar …
    [ivy:resolve] ………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………………….. (3372kB)
    [ivy:resolve] .. (0kB)
    [ivy:resolve]     [SUCCESSFUL ] org.apache.hive#hive-exec;0.8.0!hive-exec.jar (2262ms)
    [ivy:resolve] :: resolution report :: resolve 9172ms :: artifacts dl 3114ms
    [ivy:resolve]     :: evicted modules:
    [ivy:resolve]     junit#junit;3.8.1 by [junit#junit;4.5] in [buildJar]
    [ivy:resolve]     commons-logging#commons-logging;1.0.3 by [commons-logging#commons-logging;1.1.1] in [buildJar]
    [ivy:resolve]     commons-codec#commons-codec;1.2 by [commons-codec#commons-codec;1.4] in [buildJar]
    [ivy:resolve]     commons-logging#commons-logging;1.1 by [commons-logging#commons-logging;1.1.1] in [buildJar]
    [ivy:resolve]     commons-codec#commons-codec;1.3 by [commons-codec#commons-codec;1.4] in [buildJar]
    [ivy:resolve]     commons-httpclient#commons-httpclient;3.1 by [commons-httpclient#commons-httpclient;3.0.1] in [buildJar]
    [ivy:resolve]     org.codehaus.jackson#jackson-mapper-asl;1.0.1 by [org.codehaus.jackson#jackson-mapper-asl;1.7.3] in [buildJar]
    [ivy:resolve]     org.slf4j#slf4j-api;1.5.2 by [org.slf4j#slf4j-api;1.6.1] in [buildJar]
    [ivy:resolve]     org.apache.mina#mina-core;2.0.0-M4 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
    [ivy:resolve]     org.apache.ftpserver#ftplet-api;1.0.0-M2 by [org.apache.ftpserver#ftplet-api;1.0.0] in [buildJar]
    [ivy:resolve]     org.apache.ftpserver#ftpserver-core;1.0.0-M2 by [org.apache.ftpserver#ftpserver-core;1.0.0] in [buildJar]
    [ivy:resolve]     org.apache.mina#mina-core;2.0.0-M2 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
    [ivy:resolve]     commons-cli#commons-cli;1.0 by [commons-cli#commons-cli;1.2] in [buildJar]
    [ivy:resolve]     org.antlr#antlr-runtime;3.3 by [org.antlr#antlr-runtime;3.4] in [buildJar]
        ———————————————————————
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ———————————————————————
        |     buildJar     |   74  |   2   |   2   |   14  ||   61  |   2   |
        ———————————————————————
    [ivy:retrieve] :: retrieving :: org.apache.pig#Pig
    [ivy:retrieve]     confs: [buildJar]
    [ivy:retrieve]     3 artifacts copied, 58 already retrieved (3855kB/20ms)

    buildJar:
         [echo] svnString exported
          [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/build/pig-0.9.3-SNAPSHOT-core.jar
          [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/build/pig-0.9.3-SNAPSHOT.jar
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    include-meta:
         [copy] Copying 1 file to /home/anil/hadoop/pig/pig-0.9.2
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    jarWithOutSvn:

    jar-withouthadoop:
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    jar-withouthadoopWithSvn:
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    ivy-download:
          [get] Getting: http://repo2.maven.org/maven2/org/apache/ivy/ivy/2.2.0/ivy-2.2.0.jar
          [get] To: /home/anil/hadoop/pig/pig-0.9.2/ivy/ivy-2.2.0.jar
          [get] Not modified – so not downloaded

    ivy-init-dirs:

    ivy-probe-antlib:

    ivy-init-antlib:

    ivy-init:

    ivy-buildJar:
    [ivy:resolve] :: resolving dependencies :: org.apache.pig#Pig;0.9.3-SNAPSHOT
    [ivy:resolve]     confs: [buildJar]
    [ivy:resolve]     found com.sun.jersey#jersey-core;1.8 in maven2
    [ivy:resolve]     found org.apache.hadoop#hadoop-core;1.0.0 in maven2
    [ivy:resolve]     found commons-cli#commons-cli;1.2 in maven2
    [ivy:resolve]     found xmlenc#xmlenc;0.52 in maven2
    [ivy:resolve]     found commons-httpclient#commons-httpclient;3.0.1 in maven2
    [ivy:resolve]     found commons-codec#commons-codec;1.4 in maven2
    [ivy:resolve]     found org.apache.commons#commons-math;2.1 in maven2
    [ivy:resolve]     found commons-configuration#commons-configuration;1.6 in maven2
    [ivy:resolve]     found commons-collections#commons-collections;3.2.1 in maven2
    [ivy:resolve]     found commons-lang#commons-lang;2.4 in maven2
    [ivy:resolve]     found commons-logging#commons-logging;1.1.1 in maven2
    [ivy:resolve]     found commons-digester#commons-digester;1.8 in maven2
    [ivy:resolve]     found commons-beanutils#commons-beanutils;1.7.0 in maven2
    [ivy:resolve]     found commons-beanutils#commons-beanutils-core;1.8.0 in maven2
    [ivy:resolve]     found commons-net#commons-net;1.4.1 in maven2
    [ivy:resolve]     found oro#oro;2.0.8 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jetty;6.1.26 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jetty-util;6.1.26 in maven2
    [ivy:resolve]     found org.mortbay.jetty#servlet-api;2.5-20081211 in maven2
    [ivy:resolve]     found tomcat#jasper-runtime;5.5.12 in maven2
    [ivy:resolve]     found tomcat#jasper-compiler;5.5.12 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jsp-api-2.1;6.1.14 in maven2
    [ivy:resolve]     found org.mortbay.jetty#servlet-api-2.5;6.1.14 in maven2
    [ivy:resolve]     found org.mortbay.jetty#jsp-2.1;6.1.14 in maven2
    [ivy:resolve]     found org.eclipse.jdt#core;3.1.1 in maven2
    [ivy:resolve]     found ant#ant;1.6.5 in maven2
    [ivy:resolve]     found commons-el#commons-el;1.0 in maven2
    [ivy:resolve]     found net.java.dev.jets3t#jets3t;0.7.1 in maven2
    [ivy:resolve]     found net.sf.kosmosfs#kfs;0.3 in maven2
    [ivy:resolve]     found hsqldb#hsqldb;1.8.0.10 in maven2
    [ivy:resolve]     found org.apache.hadoop#hadoop-test;1.0.0 in maven2
    [ivy:resolve]     found org.apache.ftpserver#ftplet-api;1.0.0 in maven2
    [ivy:resolve]     found org.apache.mina#mina-core;2.0.0-M5 in maven2
    [ivy:resolve]     found org.slf4j#slf4j-api;1.5.2 in maven2
    [ivy:resolve]     found org.apache.ftpserver#ftpserver-core;1.0.0 in maven2
    [ivy:resolve]     found org.apache.ftpserver#ftpserver-deprecated;1.0.0-M2 in maven2
    [ivy:resolve]     found log4j#log4j;1.2.16 in maven2
    [ivy:resolve]     found org.slf4j#slf4j-log4j12;1.6.1 in maven2
    [ivy:resolve]     found org.slf4j#slf4j-api;1.6.1 in maven2
    [ivy:resolve]     found org.apache.avro#avro;1.5.3 in maven2
    [ivy:resolve]     found com.googlecode.json-simple#json-simple;1.1 in maven2
    [ivy:resolve]     found com.jcraft#jsch;0.1.38 in maven2
    [ivy:resolve]     found jline#jline;0.9.94 in maven2
    [ivy:resolve]     found net.java.dev.javacc#javacc;4.2 in maven2
    [ivy:resolve]     found org.codehaus.jackson#jackson-mapper-asl;1.7.3 in maven2
    [ivy:resolve]     found org.codehaus.jackson#jackson-core-asl;1.7.3 in maven2
    [ivy:resolve]     found joda-time#joda-time;1.6 in maven2
    [ivy:resolve]     found com.google.guava#guava;11.0 in maven2
    [ivy:resolve]     found org.python#jython;2.5.0 in maven2
    [ivy:resolve]     found rhino#js;1.7R2 in maven2
    [ivy:resolve]     found org.antlr#antlr;3.4 in maven2
    [ivy:resolve]     found org.antlr#antlr-runtime;3.4 in maven2
    [ivy:resolve]     found org.antlr#stringtemplate;3.2.1 in maven2
    [ivy:resolve]     found antlr#antlr;2.7.7 in maven2
    [ivy:resolve]     found org.antlr#ST4;4.0.4 in maven2
    [ivy:resolve]     found org.apache.zookeeper#zookeeper;3.3.3 in maven2
    [ivy:resolve]     found org.jboss.netty#netty;3.2.2.Final in maven2
    [ivy:resolve]     found org.apache.hbase#hbase;0.90.0 in maven2
    [ivy:resolve]     found org.vafer#jdeb;0.8 in maven2
    [ivy:resolve]     found junit#junit;4.5 in maven2
    [ivy:resolve]     found org.apache.hive#hive-exec;0.8.0 in maven2
    [ivy:resolve] :: resolution report :: resolve 168ms :: artifacts dl 15ms
    [ivy:resolve]     :: evicted modules:
    [ivy:resolve]     junit#junit;3.8.1 by [junit#junit;4.5] in [buildJar]
    [ivy:resolve]     commons-logging#commons-logging;1.0.3 by [commons-logging#commons-logging;1.1.1] in [buildJar]
    [ivy:resolve]     commons-codec#commons-codec;1.2 by [commons-codec#commons-codec;1.4] in [buildJar]
    [ivy:resolve]     commons-logging#commons-logging;1.1 by [commons-logging#commons-logging;1.1.1] in [buildJar]
    [ivy:resolve]     commons-codec#commons-codec;1.3 by [commons-codec#commons-codec;1.4] in [buildJar]
    [ivy:resolve]     commons-httpclient#commons-httpclient;3.1 by [commons-httpclient#commons-httpclient;3.0.1] in [buildJar]
    [ivy:resolve]     org.codehaus.jackson#jackson-mapper-asl;1.0.1 by [org.codehaus.jackson#jackson-mapper-asl;1.7.3] in [buildJar]
    [ivy:resolve]     org.slf4j#slf4j-api;1.5.2 by [org.slf4j#slf4j-api;1.6.1] in [buildJar]
    [ivy:resolve]     org.apache.mina#mina-core;2.0.0-M4 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
    [ivy:resolve]     org.apache.ftpserver#ftplet-api;1.0.0-M2 by [org.apache.ftpserver#ftplet-api;1.0.0] in [buildJar]
    [ivy:resolve]     org.apache.ftpserver#ftpserver-core;1.0.0-M2 by [org.apache.ftpserver#ftpserver-core;1.0.0] in [buildJar]
    [ivy:resolve]     org.apache.mina#mina-core;2.0.0-M2 by [org.apache.mina#mina-core;2.0.0-M5] in [buildJar]
    [ivy:resolve]     commons-cli#commons-cli;1.0 by [commons-cli#commons-cli;1.2] in [buildJar]
    [ivy:resolve]     org.antlr#antlr-runtime;3.3 by [org.antlr#antlr-runtime;3.4] in [buildJar]
        ———————————————————————
        |                  |            modules            ||   artifacts   |
        |       conf       | number| search|dwnlded|evicted|| number|dwnlded|
        ———————————————————————
        |     buildJar     |   74  |   0   |   0   |   14  ||   61  |   0   |
        ———————————————————————
    [ivy:retrieve] :: retrieving :: org.apache.pig#Pig
    [ivy:retrieve]     confs: [buildJar]
    [ivy:retrieve]     0 artifacts copied, 61 already retrieved (0kB/9ms)

    buildJar-withouthadoop:
         [echo] svnString exported
          [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/build/pig-0.9.3-SNAPSHOT-withouthadoop.jar
         [copy] Copying 1 file to /home/anil/hadoop/pig/pig-0.9.2
      [taskdef] Could not load definitions from resource net/sf/antcontrib/antcontrib.properties. It could not be found.

    jar-withouthadoopWithOutSvn:

    jar-all:

    BUILD SUCCESSFUL
    Total time: 5 minutes 38 seconds
    ==========================

    Looks like Pig was build successfully.  This step was needed to build piggybank.

    Now go to the directory where piggybank resides.

    =====================
    anil@sadbhav:~/hadoop/pig/pig-0.9.2/contrib/piggybank/java$ ant
    Buildfile: /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/build.xml

    init:

    compile:
         [echo]  *** Compiling Pig UDFs ***
        [javac] /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/build.xml:92: warning: ‘includeantruntime’ was not set, defaulting to build.sysclasspath=last; set to false for repeatable builds
        [javac] Compiling 153 source files to /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/build/classes
        [javac] Note: Some input files use or override a deprecated API.
        [javac] Note: Recompile with -Xlint:deprecation for details.

    jar:
         [echo]  *** Creating pigudf.jar ***
          [jar] Building jar: /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/piggybank.jar

    BUILD SUCCESSFUL
    Total time: 3 seconds
    ======================================

    3)  Now I have a directory to test my pig scripts.
    Let us call it “anilpig”.

    I create the following pig script (distance.pig) which is a direct copy of what Alex has:

    ======================================
    REGISTER /home/anil/hadoop/pig/pig-0.9.2/contrib/piggybank/java/piggybank.jar;

    define radians org.apache.pig.piggybank.evaluation.math.toRadians();
    define sin org.apache.pig.piggybank.evaluation.math.SIN();
    define cos org.apache.pig.piggybank.evaluation.math.COS();
    define sqrt org.apache.pig.piggybank.evaluation.math.SQRT();
    define atan2 org.apache.pig.piggybank.evaluation.math.ATAN2();

    geo = load ‘haversine.csv’ using PigStorage(‘;’) as (id1: long, lat1: double, lon1: double);
    geo2 = load ‘haversine.csv’ using PigStorage(‘;’) as (id2: long, lat2: double, lon2: double);

    geoCross = CROSS geo, geo2;

    geoDist = FOREACH geoCross GENERATE id1, id2, 6371 * 2 * atan2(sqrt(sin(radians(lat2 – lat1) / 2) * sin(radians(lat2 – lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 – lon1) / 2) * sin(radians(lon2 – lon1) / 2)), sqrt(1 – (sin(radians(lat2 – lat1) / 2) * sin(radians(lat2 – lat1) / 2) + cos(radians(lat1)) * cos(radians(lat2)) * sin(radians(lon2 – lon1) / 2) * sin(radians(lon2 – lon1) / 2)))) as dist;

    dump geoDist;
    ======================================

    Please do not forget to update the path to piggybank.jar.

    I also create the following haversine.csv file
    ===============
    1;48.8583;2.2945
    2;48.8738;2.295
    ================

    4)  Let us run pig to see if the values match what Alex quotes in his blog post.

    ==================
    ~/hadoop/pig/anilpig$ ../pig-0.9.2/bin/pig -x local distance.pig

    which: no hadoop in (/usr/local/bin:/usr/bin:/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/bin:/usr/sbin:/usr/java/jdk1.6.0_30/bin:/opt/apache-maven-3.0.2/bin:/home/anil/.local/bin:/home/anil/bin:/usr/bin:/usr/sbin:/usr/java/jdk1.6.0_30/bin:/opt/apache-maven-3.0.2/bin)
    2012-02-19 12:05:13,316 [main] INFO  org.apache.pig.Main – Logging error messages to: /home/anil/hadoop/pig/anilpig/pig_1329674713314.log
    2012-02-19 12:05:13,409 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine – Connecting to hadoop file system at: file:///
    2012-02-19 12:05:13,911 [main] WARN  org.apache.pig.PigServer – Encountered Warning IMPLICIT_CAST_TO_DOUBLE 10 time(s).
    2012-02-19 12:05:13,916 [main] INFO  org.apache.pig.tools.pigstats.ScriptState – Pig features used in the script: CROSS
    2012-02-19 12:05:14,051 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler – File concatenation threshold: 100 optimistic? false
    2012-02-19 12:05:14,083 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler$LastInputStreamingOptimizer – Rewrite: POPackage->POForEach to POJoinPackage
    2012-02-19 12:05:14,090 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size before optimization: 1
    2012-02-19 12:05:14,090 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer – MR plan size after optimization: 1
    2012-02-19 12:05:14,108 [main] INFO  org.apache.pig.tools.pigstats.ScriptState – Pig script settings are added to the job
    2012-02-19 12:05:14,114 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
    2012-02-19 12:05:14,133 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Setting up single store job
    2012-02-19 12:05:14,148 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – BytesPerReducer=1000000000 maxReducers=999 totalInputFileSize=66
    2012-02-19 12:05:14,148 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler – Neither PARALLEL nor default parallelism is set for this job. Setting number of reducers to 1
    2012-02-19 12:05:14,224 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 1 map-reduce job(s) waiting for submission.
    2012-02-19 12:05:14,234 [Thread-2] WARN  org.apache.hadoop.util.NativeCodeLoader – Unable to load native-hadoop library for your platform… using builtin-java classes where applicable
    2012-02-19 12:05:14,239 [Thread-2] WARN  org.apache.hadoop.mapred.JobClient – No job jar file set.  User classes may not be found. See JobConf(Class) or JobConf#setJar(String).
    2012-02-19 12:05:14,315 [Thread-2] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1
    2012-02-19 12:05:14,315 [Thread-2] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1
    2012-02-19 12:05:14,323 [Thread-2] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths (combined) to process : 1
    2012-02-19 12:05:14,329 [Thread-2] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1
    2012-02-19 12:05:14,329 [Thread-2] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1
    2012-02-19 12:05:14,329 [Thread-2] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths (combined) to process : 1
    2012-02-19 12:05:14,562 [Thread-3] INFO  org.apache.hadoop.util.ProcessTree – setsid exited with exit code 0
    2012-02-19 12:05:14,564 [Thread-3] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@313816e0
    2012-02-19 12:05:14,578 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – io.sort.mb = 100
    2012-02-19 12:05:14,600 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – data buffer = 79691776/99614720
    2012-02-19 12:05:14,600 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – record buffer = 262144/327680
    2012-02-19 12:05:14,636 [Thread-3] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader – Created input record counter: Input records from _0_haversine.csv
    2012-02-19 12:05:14,638 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – Starting flush of map output
    2012-02-19 12:05:14,643 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – Finished spill 0
    2012-02-19 12:05:14,645 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
    2012-02-19 12:05:14,725 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – HadoopJobId: job_local_0001
    2012-02-19 12:05:14,725 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 0% complete
    2012-02-19 12:05:17,545 [Thread-3] INFO  org.apache.hadoop.mapred.LocalJobRunner –
    2012-02-19 12:05:17,546 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task ‘attempt_local_0001_m_000000_0′ done.
    2012-02-19 12:05:17,549 [Thread-3] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@36d83365
    2012-02-19 12:05:17,551 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – io.sort.mb = 100
    2012-02-19 12:05:17,572 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – data buffer = 79691776/99614720
    2012-02-19 12:05:17,572 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – record buffer = 262144/327680
    2012-02-19 12:05:17,591 [Thread-3] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader – Created input record counter: Input records from _1_haversine.csv
    2012-02-19 12:05:17,592 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – Starting flush of map output
    2012-02-19 12:05:17,593 [Thread-3] INFO  org.apache.hadoop.mapred.MapTask – Finished spill 0
    2012-02-19 12:05:17,594 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task:attempt_local_0001_m_000001_0 is done. And is in the process of commiting
    2012-02-19 12:05:20,547 [Thread-3] INFO  org.apache.hadoop.mapred.LocalJobRunner –
    2012-02-19 12:05:20,548 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task ‘attempt_local_0001_m_000001_0′ done.
    2012-02-19 12:05:20,560 [Thread-3] INFO  org.apache.hadoop.mapred.Task -  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2f2e43f1
    2012-02-19 12:05:20,560 [Thread-3] INFO  org.apache.hadoop.mapred.LocalJobRunner –
    2012-02-19 12:05:20,564 [Thread-3] INFO  org.apache.hadoop.mapred.Merger – Merging 2 sorted segments
    2012-02-19 12:05:20,568 [Thread-3] INFO  org.apache.hadoop.mapred.Merger – Down to the last merge-pass, with 2 segments left of total size: 160 bytes
    2012-02-19 12:05:20,568 [Thread-3] INFO  org.apache.hadoop.mapred.LocalJobRunner –
    2012-02-19 12:05:20,623 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
    2012-02-19 12:05:20,623 [Thread-3] INFO  org.apache.hadoop.mapred.LocalJobRunner –
    2012-02-19 12:05:20,624 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task attempt_local_0001_r_000000_0 is allowed to commit now
    2012-02-19 12:05:20,625 [Thread-3] INFO  org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter – Saved output of task ‘attempt_local_0001_r_000000_0′ to file:/tmp/temp371866094/tmp-1622554263
    2012-02-19 12:05:23,558 [Thread-3] INFO  org.apache.hadoop.mapred.LocalJobRunner – reduce > reduce
    2012-02-19 12:05:23,558 [Thread-3] INFO  org.apache.hadoop.mapred.Task – Task ‘attempt_local_0001_r_000000_0′ done.
    2012-02-19 12:05:24,730 [main] WARN  org.apache.pig.tools.pigstats.PigStatsUtil – Failed to get RunningJob for job job_local_0001
    2012-02-19 12:05:24,732 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – 100% complete
    2012-02-19 12:05:24,732 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats – Detected Local mode. Stats reported below may be incomplete
    2012-02-19 12:05:24,734 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats – Script Statistics:

    HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
    1.0.0    0.9.3-SNAPSHOT    anil    2012-02-19 12:05:14    2012-02-19 12:05:24    CROSS

    Success!

    Job Stats (time in seconds):
    JobId    Alias    Feature    Outputs
    job_local_0001    geo,geo2,geoCross,geoDist        file:/tmp/temp371866094/tmp-1622554263,

    Input(s):
    Successfully read records from: “file:///home/anil/hadoop/pig/anilpig/haversine.csv”
    Successfully read records from: “file:///home/anil/hadoop/pig/anilpig/haversine.csv”

    Output(s):
    Successfully stored records in: “file:/tmp/temp371866094/tmp-1622554263″

    Job DAG:
    job_local_0001

    2012-02-19 12:05:24,736 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher – Success!
    2012-02-19 12:05:24,739 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat – Total input paths to process : 1
    2012-02-19 12:05:24,739 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1
    (1,1,0.0)
    (1,2,1.7239093620868347)
    (2,1,1.7239093620868347)
    (2,2,0.0)
    ===================

    Pig has kicked out map reduce in the background.

    How much time did this script take?
    Let us look at the first log entry and the last one.
    ————————-
    2012-02-19 12:05:13,316 [main] INFO  org.apache.pig.Main – Logging error

     2012-02-19 12:05:24,739 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil – Total input paths to process : 1
    ————————-
    About 11 secs.

    The run does show some stats:
    —————
    HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
    1.0.0    0.9.3-SNAPSHOT    anil    2012-02-19 12:05:14    2012-02-19 12:05:24    CROSS
    ———————
    About 10 secs.

    As you can see, the values (1.724) match with what Alex quotes.  So I have been successful in testing the Haversine script from AlexP.  Next step is to play with the script further to try out Pig’s extended functionality.

    Additional Details:
    CROSS is described here.  Computes the cross product of two or more relations.

    References
    http://fierydata.com/2012/05/11/hadoop-fundamentals-an-introduction-to-pig-2/

    PLEASE DO NOT FORGET TO SEE MY POST: View

  • Hadoop with Drools, Infinispan, PicketLink etc
    Here are the slides that I used at JUDCON 2012 in Boston.
    http://www.jboss.org/dms/judcon/2012boston/presentations/judcon2012boston_day1track3session2.pdf

    In your Map/Reduce programs, you should be able to use any Java library of your choice.  In this regard, you should be able to use:

    • Infinispan Data Grids to send cache events from your Map Reduce programs.
    • Drools rules engine to apply rules in your M/R programs.
    • PicketLink can be used for bringing in security aspects to your M/R programs.

Digest powered by RSS Digest

Today’s Links

Posted in: Big Data Technologies- Dec 04, 2012 No Comments
  • Impressions on Cloudera Impala
    Today I attended a meet up arranged by the Chicago Big Data meet up at 6pm titled “Cloudera Impala”. We were fortunate to have Marcel Kornacker, Lead Architect of Cloudera Impala as the presenter.  Marcel must have been pleasantly surprised to experience 70 Degree Fahrenheit weather in Chicago in December. It was one of those beautiful days, courtesy “Global Warming”.

    Speaker Impressions:-

    My first impressions on Marcel were as follows:  he was unlike those speakers who do the pre-talk theatrics such as going around the room shaking hands or speaking loudly.  He was moving quietly (closer to the presentation area) or having silent conversations. So I deduced him to be a geeky dude, who does not seek conversations in the presentation room. So I thought it probably is not a marketing/technical evangelist person who will be shallow on the technical details of the presentation.

    The other concern was whether he had an European accent that may be difficult to grasp, if he was too geeky.  Marcel started speaking.  As they say, do not judge a book by its cover, he drove away the accent issue and gave me the feeling that I will at least be able to hear what he is going to say. He speaks well and convincingly well on a topic where he is the subject matter expert.

    Jonathan Seidman, organizer of the Chicago Big Data group, introduced Marcel as an ex-googler who has worked on the F1 database project in the past. I did not know what F1 was at Google. It sounded like important.  That was a good introduction to set the stage for Marcel. If he was employed at Google in a core database tech field. he should definitely know things well. As a presenter, Marcel did a good job discussing the objectives, intricacies, target areas and limitations of Impala. Kudos!

    Impala Impressions :-

    Let me get back to Impala. Marcel said that the code was written in C++. Bummer. As you know, Hadoop ecosystem is primarily Java (even though you have bits and pieces and tools that are non Java such as Hadoop Streaming). I guess Marcel knows C++ well. That is why he chose to write Impala in C++.  He mentioned that the interface of Impala for applications will be via ODBC. Ok, there is the first roadblock. I write Java code. Now if I want to be excited about Impala, I will need to look at some form of JDBC to ODBC bridge or wait for Marcel’s team to code up some client utilities.  People tinkering with the Hadoop ecosystem may have the same questions/impressions as me.

    While Hive exists for Java programmers to do SQL with Hadoop ecosystem, Marcel is trying to bring in C++ to the equation.  Here is the catch though.  Impala according to Marcel, performs 3 times better than Hive in certain situations. Wow, this can be a big thing.  But alas, we cannot use Impala via Java interfaces. So we are stuck with Hive (just remember Hives is bad allergy and not fun. :) .  We are talking about Apache Hive), if we want to use SQL like interfaces into Hadoop.

    I am sure there will be takers for Impala. I am not going to be doing any experimentation with it because I do not intend to a) use C++ or ODBC or b) use CDH4. My experiments are with Apache Hadoop community version and there are enough goodies to get excited about there. :)

    Unlike Hive, Impala does not use Map Reduce underneath. It has Query Plans that get fragmented and distributed among the nodes in a cluster. There is a component that gathers the results of the plan execution. 

    After the talk on my way back, I googled Marcel to learn more about him.  I hit on the following article that gives a very good background into Marcel.
    http://www.wired.com/wiredenterprise/2012/10/kornacker-cloudera-google/
    Basically Marcel is a details guy, with a PhD from Univ of Cal at Berkeley and is an excellent cook.

    Cloudera Impala is in the hands of an excellent Chef.  Good Luck Marcel!

    Other people such as http://java.sys-con.com/node/2461455 are getting excited about Impala.  Mention of “near real time” without the use of wind river or RTOS. :)

Digest powered by RSS Digest