In this post, we will present the use of R with Hadoop provides in extracting useful information from massive amounts of data. It would be assumed here that you know the R-Programming atleast at the beginner level.
Basic Hadoop Commands
hdfs dfs– provides information about all the commands offered by hadoop file system.
fs relates to a generic file system which can point to any file systems like local, webHDFS, S3 FS, etc while
dfs is very specific to hadoop DFS.
hadoop fs -ls– list all files in the hadoop file system.
hadoop fs -copyToLocal source_path destination_path– copying a file from/to hadoop file system to/from the local file system.
Copying a file
file.csvfrom hadoop file system to Desktop (in local file system)
hadoop fs -copyToLocal ./file.csv /home/hduser/Desktop/.
Copying a file
file.csvfrom Desktop (in local file system) to the default hadoop file system directory
hadoop fs -copyFromLocal /home/hduser/file.csv
start-dfs.sh– to run the hadoop distributed file system. This establishes one namenode and the related datanodes.
start-yarn.sh– to start the master and node resource managers and map reduce.
stop-yarn.sh– to stop the hadoop distributed file system.
stop-dfs.sh– to stop the master and node resource managers and map reduce.
rstudio & – to run R through RStudio. Opening RStudio GUI
Setting system variables and loading libraries
Run the below lines in RStudio.
Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native") Sys.setenv(HADOOP_HOME="/usr/local/hadoop") Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar") Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")
There are RHadoop libraries that can be used to connect Hadoop with R.
This Package provides commands for file manipulation in terms of reading, writing and moving files.
This package allows the Hadoop MapReduce facility to be used inside the R environment.
big.data.object – the big-data object
dfs.empty – Backend-independent file manipulation
equijoin – Equijoins using map-reduce
from.dfs – Read or write
R objects from or to the file system
hadoop.settings – Important Hadoop settings in relation to rmr2
keyval – Create, project or concatenate key-value pairs
make.input.format – Create combinations of settings for flexible IO
mapreduce – MapReduce using Hadoop Streaming
rmr.options – Function to set and get package options
rmr.sample – Sample large data sets
rmr.str – Print a variable’s content
scatter – Function to split a file over several parts or to merge multiple parts into one
status – Set the status and define and increment counters for a Hadoop job
to.map – Create map-and-reduce functions from other functions
This package is used for basic data-manipulation needs.
- data manipulation
bind.cols(add new columns),
transmute(all of the above plus summaries)
- set operations
transmute appears twice because it is a generalization over transform and summarize that allows us to increase or decrease the number of columns or rows, covering the need for multi-row summaries, flattening of data structures, etc.