Astro logo

DATA POTION

Big Data and Artificial Intelligence 🤖

Using Spark with Scala in Jupyter Notebooks

Using Spark with Scala in Jupyter Notebooks

Getting started with Apache Spark can be a little tricky, especially for a person who didn't know Scala and uses Windows. The road to get it working looks more or less like this: Install Java, install IntelliJ, learn how to deal with dependency manager, maybe hit the wall because you set your PATH to the wrong folder, and probably you end up installing spark that you don't really need.

A similar problem we have with Python , there is a lot of pitfalls you can set up wrong, and then you spend 80% of your time on setting up an environment and only 20% on learning new technology you are so excited to use. Luckily there is a solution for this problem -> mybinder.org

Binder uses JupyterHub together with containers to quickly spin up fresh Jupyter Notebook ready to rock with dependencies in place. You can jump straight to the code without dealing with operational problems. This is extremely useful when you lead a workshop for relatively inexperienced people. No need to worry about setting everything up. No hidden traps that will eat your precious time during a workshop, straight to the learning.

Ok, wait a minute, how Binder could be useful for teaching Apache Spark? Does it mean that we are limited to PySpark? Absolutely not! And that’s when Almond.sh joins the game.

Almond is a Scala kernel that lets us use Scala inside Jupyter Notebooks. Among its cool features, we have Ammonite support and also ✨Spark support✨ Bingo! You can start with almond from the examples repository.

If you go straight to spark.ipnyb you will notice a couple of things.

  1. Coursier
  2. Import with $ivy
  3. NotebookSparkSesision
  4. Special SparkSession that will give you a nice progress bar to your jobs running from Jupyter notebook

Almond + Binder

The next step to our reproducible spark notebooks is to check if it works with Binder. And it is! You can run a standalone almond binder notebook straight from this link. Additionally connecting to a remote cluster is an option as well. With a remote cluster, we don’t have to worry about Spark UI. We can view it from the cluster address.

When we run spark job on Binder it also works. However, there is one downside, when we try to access SparkUI created from Binder the link does not work. But we can work around this, and by the way, get more powerful UI for our jobs.

Delight

To work around our last issue is to use Delight. It is a service provided by data mechanics to give you a more informative spark UI. To use it you have to create an account on datamechanics.co, delegate UI forwarding to Delight with spark-submit and you are set! What’s more, you have access to regular Spark UI so it’s perfect and fast to set up proxy to resolve our problem with blocked UI in Binder.

In order to see jobs in Spark UI from our Notebook you have to stop spark sessions with classic spark.stop() method

Here we are!

After working around some issues we achieved a fully-featured portable Spark Notebook with scala and working Spark User Interface. The best part is that you don’t have to remember everything and set up every property manually. With Binder, you got a prebuilt docker container for your repository. So at the end of the day, you got only one convenient link to a working notebook.

https://mybinder.org/v2/gh/co0lster/examples-1/HEAD