Chris Caspanello, an avid Spark developer, demonstrates how you can use OverOps to find errors in your Spark app. As Chris states, “Configuration issues, formatting issues, data issues, and outdated code can all wreak havoc on Spark work. In addition, operational challenges whether it’s cluster size or access to debug production issues are difficult. OverOps’ ability to detect exactly why something is broken and see a situation “A variable is invaluable in a distributed computing environment. It makes it possible to identify and resolve critical anomalies quickly and easily.”
If you are a Spark developer and are having the above or similar issues, OverOps can be a game changer. Try OverOps for free for 14 days now.
GitHub files: https://github.com/ccaspanello/overops-spark-blog
Data path not defined property
When developing transformations, most customers will use HDFS to read files. This will be a URL like `hdfs: // mycluster / path / to / data`. However, some clients will redirect local files to nodes and use a URL like `file: // path / to / data`. Unfortunately this is not true. The URL format is [schema]: //[host]/[path]. If you release the host, you will need ‘file: /// path / to / data’ with 3 slash lines forward. When the spark job is submitted to the cluster, the job will die a horrible death with little to no indication of what happened. This was fixed with pre-path validation, but finding the root cause was not easy and took a long time (more on that below). If only I had OverOps, I could quickly and easily understand why it broke. I could look at the continuous reliability console and see where the error occurred and what the error was, along with the variable state that went into the function.
The Spark user interface is great. . . When it is running
In the previous example I mentioned that finding the root cause of failure in Spark work is not easy and time consuming. This is because of how Spark’s user interface works. The Spark user interface consists of many parts: Master, Worker and Job screens. The Master and Worker screens run all the time and contain details about statistics of each service. While the Spark job is running, the job screens are available and look like this:
Here you can see which steps are enabled and get logs for running / failing steps. These logs can be useful for finding failures. Unfortunately, when the job ends or fails, the service dies and you can no longer access the logs through the web user interface. Since OverOps recognized the event, I could see it there along with the whole changing situation.
Missing titles in some files
In this example, I wrote the Spark app and tested it locally on a sample file. Everything worked fine. However, then I ran the work in my cluster in front of a real data set and it failed with the following error:
IllegalArgumentException: ‘The market does not exist. Available: US, 16132861,. . . ‘
As you can see, the exception had data in the rows and that’s good. But that alone is not enough to let us know what’s going on. Since OverOps captures a changing state while the exception is occurring, I was able to see that the schema was actually empty. The main reason was that I did not have a title bar for each part file.
In this example, there is an error similar to the one above.
IllegalArgumentException: “ID does not exist. Available: id, first_name, last_name, email, gender, ip_address”
But in this case I had a column header on my files. What happens then? Looking at the variable state in OverOps, I can see that my schema has one column named:
id, first_name, last_name, email, gender, ip_address. It tells me my separator is bad.
Scale with unknown data
Big data developers often test a small subset of data .Limit (200) . But what happens when unexpected data enters the system? Are you crashing the app? Or do you swallow the error and move on? This is always a hot topic, but in any case OverOps can find the exact place where the data could not be analyzed.
In this example the original app was coded to get a male / female gender. We now see a new valid gender value. In our scenario, we need to update our app to include POLYGENDER as well.
Apart from coding problems there are also operational challenges:
- In large Spark clusters with 100 nodes, finding the right job to find the right log is a very difficult task. This is where the Spark History server comes in handy, but sometimes cluster managers lock it in and the developer may not even have permission.
- OverOps gives us a central place to go for any error that occurs.
- Running work on massive data sets can take hours (so sometimes records are ignored or redirected).
- OverOps can detect anomalies as they occur so we can kill our work sooner and adjust the code.
- This can be a means of double cost savings: reducing developer time and reducing cloud resources invested in running a bad job.
- Sometimes logs are off to speed up / save resources
- Even if logs are turned off in the app, log events and exceptions can still be captured
As you can see, there are configuration issues, formatting issues, data issues and outdated code that can all wreak havoc on Spark work. In addition, operational challenges whether it be cluster size or access make it difficult to debug manufacturing issues. The ability of OverOps to detect exactly why something is broken and see a changing state is invaluable in a distributed computing environment. This makes it possible to identify and resolve critical exceptions quickly and easily. So if you are developing Spark and are having the above issues or similar issues, you may want to try OverOps.
Try OverOps with 14 days free trial
So if you are developing Spark and are having the above issues or similar issues, you may want to try OverOps. Get started now for free.