In this post, we will present the use of R with Hadoop provides in extracting useful information from massive amounts of data. It would be assumed here that you know the R-Programming atleast at the beginner level.

Basic Hadoop Commands

  • hadoop fs / hdfs dfs – provides information about all the commands offered by hadoop file system.

fs relates to a generic file system which can point to any file systems like local, webHDFS, S3 FS, etc while dfs is very specific to hadoop DFS.

  • hadoop fs -ls – list all files in the hadoop file system.

  • hadoop fs -copyToLocal source_path destination_path – copying a file from/to hadoop file system to/from the local file system.

    • Copying a file file.csv from hadoop file system to Desktop (in local file system) hadoop fs -copyToLocal ./file.csv /home/hduser/Desktop/.

    • Copying a file file.csv from Desktop (in local file system) to the default hadoop file system directory hadoop fs -copyFromLocal /home/hduser/file.csv

  • start-dfs.sh – to run the hadoop distributed file system. This establishes one namenode and the related datanodes.

  • start-yarn.sh – to start the master and node resource managers and map reduce.

  • stop-yarn.sh – to stop the hadoop distributed file system.

  • stop-dfs.sh – to stop the master and node resource managers and map reduce.

RHadoop

rstudio & – to run R through RStudio. Opening RStudio GUI

Setting system variables and loading libraries

Run the below lines in RStudio.

Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")

There are RHadoop libraries that can be used to connect Hadoop with R.

rhdfs

This Package provides commands for file manipulation in terms of reading, writing and moving files.

hdfs.copy

hdfs.move

hdfs.rename

hdfs.delete

hdfs.rm

hdfs.del

hdfs.chown

hdfs.put

hdfs.get

rmr2

This package allows the Hadoop MapReduce facility to be used inside the R environment.

big.data.object – the big-data object

dfs.empty – Backend-independent file manipulation

equijoin – Equijoins using map-reduce

from.dfs – Read or write R objects from or to the file system

hadoop.settings – Important Hadoop settings in relation to rmr2

keyval – Create, project or concatenate key-value pairs

make.input.format – Create combinations of settings for flexible IO

mapreduce – MapReduce using Hadoop Streaming

rmr.options – Function to set and get package options

rmr.sample – Sample large data sets

rmr.str – Print a variable’s content

scatter – Function to split a file over several parts or to merge multiple parts into one

status – Set the status and define and increment counters for a Hadoop job

to.map – Create map-and-reduce functions from other functions

plyrmr

This package is used for basic data-manipulation needs.

  • data manipulation
    • bind.cols (add new columns),

    • where (select rows),

    • select (select columns),

    • rbind, transmute (all of the above plus summaries)

  • summaries
    • transmute,

    • sample,

    • count.cols,

    • quantile.cols,

    • top.k,

    • bottom.k

  • set operations
    • union,

    • intersect,

    • unique,

    • merge

transmute appears twice because it is a generalization over transform and summarize that allows us to increase or decrease the number of columns or rows, covering the need for multi-row summaries, flattening of data structures, etc.

Useful Resources