IBM Support

Spark SQL vs. Big SQL Performance

Technical Blog Post


Abstract

Spark SQL vs. Big SQL Performance

Body

Last month, we provided an update of Big SQL vs Hive performance tests running the Hadoop-DS benchmark. Hive is based on map reduce and Java while Big SQL is using a native C/C++ MPP engine – so its not surprising that Big SQL was 20X faster, on average.

This month, we’ve performed a similar test against Spark SQL. It is commonly said that Spark is 10X to 100X faster than Map Reduce. How will Big SQL compare?

Both Spark SQL and Big SQL leverage Hive metastore and storage model. The major difference between the two (from SQL perspective) is the Optimizer and execution engine.

Preserving Hive metastore and its storage model is critically important for preserving openness of your data. Hive is the de-facto standard for SQL on Hadoop as it is included in every commercial Hadoop distribution. Big SQL preserves your data in Hive format so that if you fall out of love with Big SQL, you can always uninstall it and use something else….. your data remains compatible with Hive or other SQL engines that support the same standard like Spark SQL.

For this test, we compared the latest version of Spark 1.5.0 and Big SQL V4.1 – both running on the IBM Open Platform. (We actually also tested Spark 1.5.1 but it actually ran 7% slower than 1.5.0 — so we’re sharing Spark’s best result). Two 20-node clusters were setup with the same specifications on Softlayer using bare metal servers configured according to IBM’s reference architecture.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259977