In this post, we will present the use of R with Hadoop provides in extracting useful information from massive amounts of data. It would be assumed here that you know the R-Programming atleast at the beginner level.
Basic Hadoop Commands
hadoop fs
/hdfs dfs
– provides information about all the commands offered by hadoop file system.
fs
relates to a generic file system which can point to any file systems like local, webHDFS, S3 FS, etc while dfs
is very specific to hadoop DFS.
-
hadoop fs -ls
– list all files in the hadoop file system. -
hadoop fs -copyToLocal source_path destination_path
– copying a file from/to hadoop file system to/from the local file system.-
Copying a file
file.csv
from hadoop file system to Desktop (in local file system)hadoop fs -copyToLocal ./file.csv /home/hduser/Desktop/
. -
Copying a file
file.csv
from Desktop (in local file system) to the default hadoop file system directoryhadoop fs -copyFromLocal /home/hduser/file.csv
-
-
start-dfs.sh
– to run the hadoop distributed file system. This establishes one namenode and the related datanodes. -
start-yarn.sh
– to start the master and node resource managers and map reduce. -
stop-yarn.sh
– to stop the hadoop distributed file system. -
stop-dfs.sh
– to stop the master and node resource managers and map reduce.
RHadoop
rstudio &
– to run R through RStudio. Opening RStudio GUI
Setting system variables and loading libraries
Run the below lines in RStudio.
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
There are RHadoop libraries that can be used to connect Hadoop with R.
rhdfs
This Package provides commands for file manipulation in terms of reading, writing and moving files.
hdfs.copy
–
hdfs.move
–
hdfs.rename
–
hdfs.delete
–
hdfs.rm
–
hdfs.del
–
hdfs.chown
–
hdfs.put
–
hdfs.get
–
rmr2
This package allows the Hadoop MapReduce facility to be used inside the R environment.
big.data.object
– the big-data object
dfs.empty
– Backend-independent file manipulation
equijoin
– Equijoins using map-reduce
from.dfs
– Read or write R
objects from or to the file system
hadoop.settings
– Important Hadoop settings in relation to rmr2
keyval
– Create, project or concatenate key-value pairs
make.input.format
– Create combinations of settings for flexible IO
mapreduce
– MapReduce using Hadoop Streaming
rmr.options
– Function to set and get package options
rmr.sample
– Sample large data sets
rmr.str
– Print a variable’s content
scatter
– Function to split a file over several parts or to merge multiple parts into one
status
– Set the status and define and increment counters for a Hadoop job
to.map
– Create map-and-reduce functions from other functions
plyrmr
This package is used for basic data-manipulation needs.
- data manipulation
-
bind.cols
(add new columns), -
where
(select rows), -
select
(select columns), -
rbind
,transmute
(all of the above plus summaries)
-
- summaries
-
transmute
, -
sample
, -
count.cols
, -
quantile.cols
, -
top.k
, -
bottom.k
-
- set operations
-
union
, -
intersect
, -
unique
, -
merge
-
transmute
appears twice because it is a generalization over transform and summarize that allows us to increase or decrease the number of columns or rows, covering the need for multi-row summaries, flattening of data structures, etc.