In this post, we will present the use of R with Hadoop provides in extracting useful information from massive amounts of data. It would be assumed here that you know the R-Programming atleast at the beginner level.
Basic Hadoop Commands
hadoop fs/hdfs dfs– provides information about all the commands offered by hadoop file system.
fs relates to a generic file system which can point to any file systems like local, webHDFS, S3 FS, etc while dfs is very specific to hadoop DFS.
-
hadoop fs -ls– list all files in the hadoop file system. -
hadoop fs -copyToLocal source_path destination_path– copying a file from/to hadoop file system to/from the local file system.-
Copying a file
file.csvfrom hadoop file system to Desktop (in local file system)hadoop fs -copyToLocal ./file.csv /home/hduser/Desktop/. -
Copying a file
file.csvfrom Desktop (in local file system) to the default hadoop file system directoryhadoop fs -copyFromLocal /home/hduser/file.csv
-
-
start-dfs.sh– to run the hadoop distributed file system. This establishes one namenode and the related datanodes. -
start-yarn.sh– to start the master and node resource managers and map reduce. -
stop-yarn.sh– to stop the hadoop distributed file system. -
stop-dfs.sh– to stop the master and node resource managers and map reduce.
RHadoop
rstudio & – to run R through RStudio. Opening RStudio GUI
Setting system variables and loading libraries
Run the below lines in RStudio.
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
There are RHadoop libraries that can be used to connect Hadoop with R.
rhdfs
This Package provides commands for file manipulation in terms of reading, writing and moving files.
hdfs.copy –
hdfs.move –
hdfs.rename –
hdfs.delete –
hdfs.rm –
hdfs.del –
hdfs.chown –
hdfs.put –
hdfs.get –
rmr2
This package allows the Hadoop MapReduce facility to be used inside the R environment.
big.data.object – the big-data object
dfs.empty – Backend-independent file manipulation
equijoin – Equijoins using map-reduce
from.dfs – Read or write R objects from or to the file system
hadoop.settings – Important Hadoop settings in relation to rmr2
keyval – Create, project or concatenate key-value pairs
make.input.format – Create combinations of settings for flexible IO
mapreduce – MapReduce using Hadoop Streaming
rmr.options – Function to set and get package options
rmr.sample – Sample large data sets
rmr.str – Print a variable’s content
scatter – Function to split a file over several parts or to merge multiple parts into one
status – Set the status and define and increment counters for a Hadoop job
to.map – Create map-and-reduce functions from other functions
plyrmr
This package is used for basic data-manipulation needs.
- data manipulation
-
bind.cols(add new columns), -
where(select rows), -
select(select columns), -
rbind,transmute(all of the above plus summaries)
-
- summaries
-
transmute, -
sample, -
count.cols, -
quantile.cols, -
top.k, -
bottom.k
-
- set operations
-
union, -
intersect, -
unique, -
merge
-
transmute appears twice because it is a generalization over transform and summarize that allows us to increase or decrease the number of columns or rows, covering the need for multi-row summaries, flattening of data structures, etc.