Red Oak Strategic

  • Home
  • About Us
  • Services
  • Blog
Contact Us
Mark Stephenson
Thursday, 21 June 2018 / Published in Analytics, Data Science, Machine Learning, Predictive Analytics, R, Tutorials

Analytics at Scale: h2o, Apache Spark and R on AWS EMR

At Red Oak Strategic, we utilize a number of machine learning, AI and predictive analytics libraries, but one of our favorites is h2o.  Not only is it open-source, powerful and scalable, but there is a great community of fellow h2o users that have helped over the years, not to mention the staff leadership at the company is very responsive (shoutout Erin LeDell and Arno Candel).

However, as we were building models on large datasets, as we regularly do, our data science team began running into scale issues while running these models and scores locally, so we needed to move to a more distributed solution.  h2o’s Sparkling Water, leveraging the h2o algorithms on top of Apache Spark, was a perfect solution.  As an AWS Partner, we wanted to utilize the Amazon Web Services EMR solution, but as we built these solutions, we also wanted to write up a full tutorial end-to-end for our tasks, so the other h2o users in the community can benefit.  Shoutout as well to Rahul Pathak at AWS for his help with EMR over the years.

This blog will cover the following topics, allowing you to start from no AWS infrastructure, building to a powerful, scaled and distributed predictive analytics and machine learning setup, built on top of Apache Spark:

  1. Create an AWS EMR cluster (we assume you already have an AWS account)
  2. SSH and install h2o Sparkling Water and necessary dependencies
  3. Make some configuration tweaks
  4. Connect from your local machine in R/RStudio
  5. Model and score!
  6. Spin down your cluster

I will note: we started with h2o’s built-in instructions for spinning up an EMR cluster, but made a few tweaks to our own setup, and to upgrade versions.

Create an AWS EMR Cluster

There are two methods to creating your EMR cluster:  through the AWS CLI (command line tool), or via the Amazon Web Services UI.  We prefer to use the CLI for flexibility, so we will utilize that method.  If you’d like to use the UI, these instructions outline those steps.  Again, I am assuming your AWS CLI is configured and setup, but if not, follow these instructions to do so.

Make note that below you will need to customize your KeyName , the subnets/security group information, tags , as well as the number of instances ( InstanceCount )/size of your cluster (I’ve set mine at 6), as well as the InstanceType  for your head node, as well as the workers.

Note: This assumes Spark 2.3.0, Sparkling Water 2.3.5, and h2o 3.18.10 are the versions you wish to install. To install other versions, change the commands as needed.

Things to review/update:

  • tags
  • ec2-attributes
    • KeyName
    • SecurityGroup / SubnetId
  • release-label  (which version of the basic software do you want?)
    • Note: emr-5.13 has Spark 2.3.0 installed
  • name
  • instance-groups
    • InstanceCount
    • InstanceType
  • region
  • profile  (set to profile name from your own ~/.aws/credentials file )
1
2
3
4
5
6
7
8
9
10
11
12
$ aws emr create-cluster \
--applications Name=Hadoop Name=SPARK Name=Zeppelin Name=Ganglia \
--no-visible-to-all-users \
--tags 'Owner=mark' 'Purpose=H2O Sparkling Water Deployment' 'Name=h2o-spark' \
--ec2-attributes '{"KeyName":"your_key_name_here","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-xxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxxxx","AdditionalMasterSecurityGroups":["sg-xxxxxx"]}' \
--service-role EMR_DefaultRole \
--release-label emr-5.13.0 \
--name 'h2o-spark' \
--instance-groups '[{"InstanceCount":6,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":100,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"CORE","InstanceType":"c4.8xlarge"},{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":100,"VolumeType":"gp2"},"VolumesPerInstance":1}]},"InstanceGroupType":"MASTER","InstanceType":"c4.8xlarge"}]' \
--scale-down-behavior TERMINATE_AT_INSTANCE_HOUR \
--region us-east-2 \
--profile mark
After a few minutes, your cluster should be available and viewable, either through the EMR UI, or from the command line.
1
$ aws emr list-clusters

SSH and Configure Cluster

Something that we wanted to customize a bit was the shell script that runs after spinning up our cluster.  This is something long term that we can automate as a bootstrap script, but for purposes of this tutorial we’re listing the full file.

First, SSH into the head node (from your terminal, ensure SSH port is open from your IP on the master node security group):

1
$ ssh -i your_key_name_here.pem hadoop@ec2.your.master.node.ip.here

Run the following on the head node after SSHing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
set -x -e
 
sudo python -m pip install --upgrade pip
sudo python -m pip install --upgrade colorama
sudo ln -sf /usr/local/bin/pip2.7 /usr/bin/pip
 
sudo pip install -U requests
sudo pip install -U tabulate
sudo pip install -U future
sudo pip install -U six
 
#Scikit Learn on the nodes
sudo pip install -U scikit-learn
 
version=2.3.5
h2oBuild=5
SparklingBranch=rel-${version}
 
mkdir -p /home/hadoop/h2o
cd /home/hadoop/h2o
 
echo -e "\n Installing sparkling water version $version build $h2oBuild "
wget https://h2o-release.s3.amazonaws.com/sparkling-water/${SparklingBranch}/${h2oBuild}/sparkling-water-${version}.${h2oBuild}.zip &
unzip -o sparkling-water-${version}.${h2oBuild}.zip 1> /dev/null &
 
export MASTER="yarn"
 
echo -e "\n Rename jar and Egg files"
mv /home/hadoop/h2o/sparkling-water-${version}.${h2oBuild}/assembly/build/libs/*.jar /home/h2o/sparkling-water-${version}.${h2oBuild}/assembly/build/libs/sparkling-water-assembly-all.jar
mv /home/hadoop/h2o/sparkling-water-${version}.${h2oBuild}/py/build/dist/*.egg /home/h2o/sparkling-water-${version}.${h2oBuild}/py/build/dist/pySparkling.egg
 
echo -e "\n Creating SPARKLING_HOME env ..."
export SPARKLING_HOME="/home/hadoop/h2o/sparkling-water-${version}.${h2oBuild}"
export PYTHON_EGG_CACHE="~/"
export SPARK_HOME="/usr/lib/spark"

Launch the Sparkling Water Cluster

Once your install script is finished successfully, navigate to the sparkling-water install directory:

1
$ cd sparkling-water-2.3.5

To launch the Spark cluster, type the following command with any edits for the specific EMR setup you initialized (eg. instances number, cores, memory):

1
2
3
4
5
6
7
8
9
10
11
12
13
$ bin/sparkling-shell \
--master yarn \
--conf spark.deploy-mode=cluster \
--conf spark.executor.instances=6 \
--conf spark.executor.cores=36 \
--conf spark.executor.memory=40g \
--conf spark.driver.memory=40g \
--conf spark.scheduler.maxRegisteredResourcesWaitingTime=1000000 \
--conf spark.ext.h2o.fail.on.unsupported.spark.param=false \
--conf spark.dynamicAllocation.enabled=false \
--conf spark.sql.autoBroadcastJoinThreshold=-1 \
--conf spark.locality.wait=30000 \
--conf spark.scheduler.minRegisteredResourcesRatio=1

Once you see the Spark/Scala prompt run these commands to start h2o:

1
2
3
import org.apache.spark.h2o._
val h2oContext = H2OContext.getOrCreate(spark)
import h2oContext._

If you’ve configured everything correctly, your output will look like this (with different IP addresses for your instances – the head and the worker machines):

Scala
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
scala> import org.apache.spark.h2o._
import org.apache.spark.h2o._
 
scala> val h2oContext = H2OContext.getOrCreate(spark)
h2oContext: org.apache.spark.h2o.H2OContext =                                  
 
Sparkling Water Context:
* H2O name: sparkling-water-hadoop_application_1234567890_0001
* cluster size: 6
* list of used nodes:
  (executorId, host, port)
  ------------------------
  (1,ip-172-00-00-11.us-east-2.compute.internal,54321)
  (3,ip-172-00-00-12.us-east-2.compute.internal,54321)
  (4,ip-172-00-00-13.us-east-2.compute.internal,54321)
  (6,ip-172-00-00-14.us-east-2.compute.internal,54321)
  (5,ip-172-00-00-15.us-east-2.compute.internal,54321)
  (2,ip-172-00-00-16.us-east-2.compute.internal,54321)
  ------------------------
 
  Open H2O Flow in browser: http://ip-172-00-00-11.us-east-2.compute.internal:54321 (CMD + click in Mac OSX)
 
scala> import h2oContext._
import h2oContext._

The H2O Flow will be available at the public IP of the head node at port 54321. For example:

1
your.ec2.instance.public.ip:54321

Something to check: make sure port 54321 is open in your security groups, so you can connect from your IP.

Connect from your local RStudio

To utilize our cluster, we now can connect from RStudio.

R
1
2
3
4
5
6
7
library(tidyverse)
library(sparklyr)
library(h2o)
library(janitor)
options(rsparkling.sparklingwater.version = "2.3.5")
library(rsparkling)
h2o.init(ip = "ip address of master node", port = 54321)

Now, instead of utilizing the local R cluster on your laptop or desktop, you can utilize the distributed version on AWS.

Get Data Into Your Cluster

Something we wanted to add here: the speed of transferring files locally to your R cluster can significantly impact performance.  We would recommend utilizing the s3 ImportFile function, which significantly speeds up that data transfer.

1
2
airlinesURL <- "https://s3.amazonaws.com/h2o-airlines-unpacked/allyears2k.csv"
airlines.hex <- h2o.importFile(path = airlinesURL, destination_frame = "airlines.hex")

Monitor Your Cluster

h2o includes a very nice cluster monitoring tool called “WaterMeter” which allows you to view and monitor the cluster CPU usage.  In this screenshot, we have an 8-node cluster – look at all of those beautiful CPUs!

Terminate Your Cluster

Once you’ve finished your work, and you want to spin down or terminate your cluster, use the following command from your local machine (in your Terminal):

1
$ aws emr terminate-clusters --cluster-ids j-3KVXXXXXXX7UG

Wrapping Up

Again, we want to thank the h2o community as well as h2o and Amazon Web Services staff for their assistance and guidance.  There are almost unlimited possibilities for this implementation, but we’d love to hear from you about ways you’re using Sparkling Water or h2o in your work.  Or, if you have questions/comments for the Red Oak Strategic team, please leave them below and we’ll get back in touch.  Thanks for reading.

  • Tweet
Tagged under: Apache Spark, Code, Data Science, h2o, Predictive Analytics, R, Sparkling Water, Tutorials

What you can read next

Why Polling, Data & Analytics Will Finally Converge in 2016
Selecting Multiple Locations on a Leaflet Map in R Shiny
Tracking Coronavirus: Building Parameterized Reports to Analyze Changing Data Sources

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Recent Posts

  • Tracking Coronavirus: Building Parameterized Reports to Analyze Changing Data Sources

    The pace of our modern world, and the impressiv...
  • Draw Rotatable 3D Charts in R Shiny with Highcharts and JQuery

    While it might be tempting to liven up a report...
  • Customizing Click Events: How to Capture and Store Data from JavaScript Objects in R Variables

    Interaction Design for Data Exploration Visuali...
  • Time Series Forecasting with Machine Learning: An example from Kaggle

    Introduction This post will demonstrate how to ...
  • Blockchain from a Data Science Perspective

    Imagine someone were to mint a new coin and giv...

Categories

  • Analytics
  • Blockchain
  • Code
  • Data
  • Data Science
  • Data Visualizations
  • Databases
  • Excel
  • Financial Analytics
  • Forecasting
  • JavaScript
  • Machine Learning
  • Political Analytics
  • Predictive Analytics
  • Python
  • R
  • R-Shiny
  • Tutorials
  • Uncategorized
TOP

Insert/edit link

Enter the destination URL

Or link to existing content

    No search term specified. Showing recent items. Search or use up and down arrow keys to select an item.