IBM has announced a series of investments into the Apache Spark project that could turn it into a big data wildfire.
Most IBM announcements usually start by talking about the billions of dollars IBM is investing into a technology. In this announcement IBM has not put a monetary value on the investment but has announced that more than 3,500 IBM researchers and developers will be working on Spark related projects across a dozen labs worldwide.
If this investment in personnel wasn’t enough, IBM has said in the press release that it intends to educate more than 1m data scientists and engineers on Spark. This will do carried out through its partnerships with training companies and the availability of Massive Open Online Courses (MOOC).
What does IBM plan to do with Spark?
IBM is heavily focused on big data. It has spent big on acquiring companies and technology as well as building out its own products in this area. At IBM Edge in Las Vegas a few weeks ago, new announcements around storage focused on software-defined storage. Add to that Watson and the Internet of Things, IBM is keen to be seen as the apex big data player.
Including its own staff and the commitment to training 1m data professionals, there are six things identified in the press release that IBM intends to do with Spark:
- IBM will build Spark into the core of the company’s analytics and commerce platforms.
- IBM’s Watson Health Cloud will leverage Spark as a key underpinning for its insight platform, helping to deliver faster time to value for medical providers and researchers as they access new analytics around population health data.
- IBM will open source its breakthrough IBM SystemML machine learning technology and collaborate with Databricks to advance Spark’s machine learning capabilities.
- IBM will offer Spark as a Cloud service on IBM Bluemix to make it possible for app developers to quickly load data, model it, and derive the predictive artifact to use in their app.
- IBM will commit more than 3,500 researchers and developers to work on Spark-related projects at more than a dozen labs worldwide, and open a Spark Technology Center in San Francisco for the Data Science and Developer community to foster design-led innovation in intelligent applications.
- IBM will educate more than 1 million data scientists and data engineers on Spark through extensive partnerships with AMPLab, DataCamp, MetiStream, Galvanize and Big Data University MOOC.
According to Beth Smith, General Manager, Analytics Platform, IBM Analytics: “We believe strongly in the power of open source as the basis to build value for clients, and are fully committed to Spark as a foundational technology platform for accelerating innovation and driving analytics across every business in a fundamental way.”
Why is Spark different from other Open Source solutions?
One of the key benefits of Spark is that it builds on existing big data technologies such as Hadoop and supports a wide range of data sources. Spark is described as a big data processing framework with a sophisticated set of analytics. It is not the only big data framework available and compared to Hadoop and MapReduce, Spark has been the forgotten project in this field.
What is different here is that Spark is not a sequential processing solution such as Hadoop and MapReduce. When using these technologies, each query has to be completed before the next query is started. This locks large amounts of data from being used elsewhere unless the user creates their own subset which is expensive in terms of machine time and data.
Spark uses an approach called Directed Acyclic Graph (DAG). This allows it to run multi-step data pipelines. What makes this interesting is that Spark will allow multiple DAGs to share in-memory data which means that users are able to run several queries against the same data at the same time.
The use of in-memory storage also extends to how queries are managed. Rather than write data results to disk, Spark can continue to hold them in memory, especially if they are going to be used in other queries. With the amount of addressable memory in servers continue to increase this means that ever more complex queries can be executed in memory.
How much impact will IBM have on Spark?
What will be interesting here is how much effort IBM pours into the Apache Spark project. It is far from the biggest contributor at the moment but that should change with this announcement. Many in the Spark ecosystem will be wondering what impact this will have on them.
It is not just development resources that are interesting here. IBM’s intention to offer Spark as a Service on Bluemix will bring it to a lot more developers. Another thing developers will look for is the ability to take advantage of IBM’s Coherent Accelerator Processor Interface (CAPI) and flash memory. Using this approach, Spark could push data to flash memory which will be much faster than disk and only a little slower than RAM. The result of this could be tens of terabytes of data being held in-memory.
Where this is likely to have a bigger impact will be the Internet of Things (IoT), security analytics and projects such as the Square Kilometer Array. One of the challenges of these projects is the vast amount of data that they generate. Much of that data is noise rather than something that delivers value. This is why IBM has been focusing on dealing with it at the point of acquisition.
The Watson Health Cloud is an example of the challenge of huge volumes of data as part of an IoT solution. Phones, wearables and other mobile devices are beginning to capture increasing amounts of user data. Combining that data with information from medical devices and then using it in research requires a lot of complex processing, a significant amount of which may have to happen in real-time. This play heavily to the ability of Spark to manage complex queries.
With the new micro-servers IBM is developing in Zurich, it will be able to deploy advanced processing right at the point of acquisition. It could also embed Spark into those System on Chip (SoC) solutions allowing complex analytics to be carried out on data as it is acquired. This would reduce the storage requirement of data, reduce the network bandwidth consumed by moving data around the enterprise and deliver a significantly high value to customers.
While the micro-servers that IBM Zurich is developer are a few years out, IBM is already beginning to use containers to move applications to the data. It would be reasonable to expect IBM to release a Docker container with Spark inside in the not too distant future.
Perhaps the biggest impact of IBM’s involvement is likely to be felt by the enterprise big data teams and MapReduce community. Spark has failed to ignite much interest inside the enterprise to date but with IBM’s backing, it is not unreasonable to see it replacing MapReduce as the preferred option for complex analytics.
Spark is yet another Open Source project around which IBM is wrapping its development might. The intention to offer Spark as a Service and combine Spark with the Watson Health Cloud are likely to get a lot of initial attention.
However, for security teams trying to understand in real-time the risk to their environment and to track attacks as they happen, Spark could be a step change in their tooling. Surprisingly, IBM has not talked about Spark in this context .