R and Hadoop

In this post, we will present the use of R with Hadoop provides in extracting useful information from massive amounts of data. It would be assumed here that you know the R-Programming atleast at the beginner level.

Basic Hadoop Commands

hadoop fs / hdfs dfs – provides information about all the commands offered by hadoop file system.

fs relates to a generic file system which can point to any file systems like local, webHDFS, S3 FS, etc while dfs is very specific to hadoop DFS.

hadoop fs -ls – list all files in the hadoop file system.
hadoop fs -copyToLocal source_path destination_path – copying a file from/to hadoop file system to/from the local file system.
- Copying a file file.csv from hadoop file system to Desktop (in local file system) hadoop fs -copyToLocal ./file.csv /home/hduser/Desktop/.
- Copying a file file.csv from Desktop (in local file system) to the default hadoop file system directory hadoop fs -copyFromLocal /home/hduser/file.csv
start-dfs.sh – to run the hadoop distributed file system. This establishes one namenode and the related datanodes.
start-yarn.sh – to start the master and node resource managers and map reduce.
stop-yarn.sh – to stop the hadoop distributed file system.
stop-dfs.sh – to stop the master and node resource managers and map reduce.

RHadoop

rstudio & – to run R through RStudio. Opening RStudio GUI

Setting system variables and loading libraries

Run the below lines in RStudio.

Sys.setenv(HADOOP_OPTS="-Djava.library.path=/usr/local/hadoop/lib/native")
Sys.setenv(HADOOP_HOME="/usr/local/hadoop")
Sys.setenv(HADOOP_CMD="/usr/local/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar")
Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-openjdk-amd64")

There are RHadoop libraries that can be used to connect Hadoop with R.

rhdfs

This Package provides commands for file manipulation in terms of reading, writing and moving files.

hdfs.copy –

hdfs.move –

hdfs.rename –

hdfs.delete –

hdfs.rm –

hdfs.del –

hdfs.chown –

hdfs.put –

hdfs.get –

rmr2

This package allows the Hadoop MapReduce facility to be used inside the R environment.

big.data.object – the big-data object

dfs.empty – Backend-independent file manipulation

equijoin – Equijoins using map-reduce

from.dfs – Read or write R objects from or to the file system

hadoop.settings – Important Hadoop settings in relation to rmr2

keyval – Create, project or concatenate key-value pairs

make.input.format – Create combinations of settings for flexible IO

mapreduce – MapReduce using Hadoop Streaming

rmr.options – Function to set and get package options

rmr.sample – Sample large data sets

rmr.str – Print a variable’s content

scatter – Function to split a file over several parts or to merge multiple parts into one

status – Set the status and define and increment counters for a Hadoop job

to.map – Create map-and-reduce functions from other functions

plyrmr

This package is used for basic data-manipulation needs.

data manipulation
- bind.cols (add new columns),
- where (select rows),
- select (select columns),
- rbind, transmute (all of the above plus summaries)
summaries
- transmute,
- sample,
- count.cols,
- quantile.cols,
- top.k,
- bottom.k
set operations
- union,
- intersect,
- unique,
- merge

transmute appears twice because it is a generalization over transform and summarize that allows us to increase or decrease the number of columns or rows, covering the need for multi-row summaries, flattening of data structures, etc.