Friday 22 May 2015

Differences between batch processing and stream processing systems

To me a stream processing system:
  • Computes a function of one data element, or a smallish window of recent data
  • Computes something relatively simple
  • Needs to complete each computation in near-real-time -- probably seconds at most
  • Computations are generally independent
  • Asynchronous - source of data doesn't interact with the stream processing directly, like by waiting for an answer

A batch processing system to me is just the general case, rather than a special type of processing, but I suppose you could say that a batch processing system:
  • Has access to all data
  • Might compute something big and complex
  • Is generally more concerned with throughput than latency of individual components of the computation
  • Has latency measured in minutes or more

I sometimes hear streaming used as a sort of synonym for real-time . Real-time stuff usually takes the form of needing to respond to an event in milliseconds, as in a synchronous API. This isn't streaming to me.

Batch processing is very efficient in processing high volume data. Where data is collected, entered to the system, processed and then results are produced in batches. Here time taken for the processing is not an issue. Batch jobs are configured to run without manual intervention, trained against entire data set at scale in order to produce output in the form of computational analyses and data files. Depending on the size of the data being processed and the computational power of the system, output can be delayed significantly.

In contrast, stream processing involves continual input and outcome of data. It emphasizes on the Velocity of the data. Data must be processed within small time period or near real time (keep in mind real time system and stream processing systems are different concepts which are sometimes used interchangeably, for details check this question What's the difference between real-time processing and stream processing?).  Streaming processing gives decision makers the ability to adjust to contingencies based on events and trends developing in real-time. 

As there are 3Vs of Big Data (Volume, Velocity, Variety). How I understand this is if your only concern is Volume of the data, then Batch processing is a way to go and if you have to take into consideration Velocity of data also (continuous data) and you need outcome within specific time limits(like in seconds or so) then Stream Processing Engines are there to help you.

(sources from internet)