Installing Bigstream on AWS EMR

This document has 3 sections: 
3- Example Run- Try It!


   Supported in this
 Spark Release  Release 2.0  
 Storage /
 HDFS, S3, Blob, Kafka  
 Data types  AVRO, CSV, JSON  gzip and LZO compression  are not supported in this  release
 Operators  Dataframes/SQL  


Deploying Bigstream Hyper-acceleration software on your AWS EMR solution is quite simple.

If you have not already subscribed to our software, please go to AWS Marketplace product page URL listed below, and follow the steps until your registration is completed.

Once your subscription and registration is complete, you will be given a URL that you can use for the deployment of your software when you provision a new EMR cluster.  Every time you provision a new cluster, you can use the same URL.  If you do not add the URL when you provision a new EMR cluster, Bigstream software would not be installed.

STEP 1- Start your EMR provisioning tool on AWS EMR.

STEP 2- After Clicking on "Create Cluster", select "Go to advanced options"

STEP 3- Setup your Software Configuration as usual.

STEP 4- Setup your Hardware Configuration and click on "Next"

STEP 5- In the "Additional Options" field, select "Bootstrap Actions"

STEP 6- Then select "Custom Actions"

STEP 7- Click on "Configure and add" button

STEP 8- Paste the deployment URL that was emailed to you in the "Script Location" field and click "Add".

STEP 9- Click the "Next" Button and you are Done.  Bigstream will be provisioned on your cluster automatically and should start accelerating your program.

3- Example Run- Try It!

After you have started your cluster with Bigstream software, you can run an example SQL query to test out Bigstream acceleration.

You should see the directory /home/hadoop/example/ on your cluster. To run the example accelerated, run the spark-submit command: 

spark-submit --master yarn --jars /home/hadoop/spark-avro_2.11-3.0.0.jar --class co.bigstream.benchmark.TPCSQ69 /home/hadoop/example/tpcds-avro_2.11-1.0.jar 2 s3a://mytpcds100g/pdata100SF

The two last parameters are the number of iterations the query will be run and the location of the input data on s3, respectively.

The code will run accelerated by default on your Bigstream Hyper-accelerated cluster, and will print output in tabular form as well as runtime for each iteration marked by the string "End-to-End Time" (in milliseconds).

For comparison, you can run the query unaccelerated (i.e. with standard Spark) as follows:

spark-submit --master yarn --jars /home/hadoop/spark-avro_2.11-3.0.0.jar --conf spark.bigstream.accelerate=false --class co.bigstream.benchmark.TPCSQ69 /home/hadoop/example/tpcds-avro_2.11-1.0.jar 2 s3a://mytpcds100g/pdata100SF

(I.e. add the configuration --conf spark.bigstream.accelerate=false) 

NOTE: The test will take up to 10 minutes in the unaccelerated mode. Compare the End-to-End runtimes to see the level of acceleration you see in your cluster.

Did you find this article helpful?