Dempsy

Distributed Elastic Message Processing System

Project 

Dempsy Project Sponsors
YourKit is kindly supporting open source projects with its full-featured Java Profiler. YourKit, LLC is the creator of innovative and intelligent tools for profiling Java and .NET applications. Take a look at YourKit's leading software products: YourKit Java Profiler and YourKit .NET Profiler.

Welcome to the Dempsy project - a framework for the easy implementation of stream-based real-time fully-distributed elastic analytics processing applications.

Features

If you're already familiar with real-time stream based BigData engines, the following list of features will distinguish Dempsy from the others:

  • Fine grained "actor model": Dempsy provides for the fine grained distribution and lifecycle management of (potentially) millions of "actors" (message processors in Dempsy parlance) across a large cluster of machines allowing developers to write code that concentrates on handling individual data points in a stream without any concern for concurrency.
  • Inversion of control programming paradigm: Dempsy allows developers to construct these large-scale processing applications decoupled from all infrastructure concerns providing a means for simple and testable POJO implementations.
  • Dynamic Topology Autoconfiguration: Dempsy doesn't require the prior definition of a topology for the stream processing. Not only does it dynamically discover the topology, but an application's topology can morph at runtime. Therefore, there's no complicated topology defining DSL or configuration necessary.
  • Integrates with available IAAS and PAAS tools: Dempsy doesn't pretend to be an application server or a batch processing system and therefore the provisioning, deployment and management of nodes in an application are provided by cloud management tools already part of your DevOps infrastructure. It fully integrates with various monitoring solutions including: JMX, Graphite, and Ganglia.
  • Fully elastic: Expanding Dempsy's cooperation with existing IAAS/PAAS tools, elasticity allows for the dynamic provisioning and deprovisioning of nodes of the application at runtime. This allows for optimizing the use of computational resources over time (Currently available in the trunk revision).
  • Reliable/guaranteed message delivery: Dempsy provides for a simple means of handling reliable message processing providing at-least once message handling (coming in April, 2013).
  • Documentation

    See the Dempsy User Guide.

    Since we are updating the documentation frequently, we recommend that you watch the Documentation issue (at the bottom of the issue page click on "Enable notifications for this issue")

    If you prefer, you can go straight to the initial tutorial "Simple Example"

    The Dempsy Examples repository has several versions of the WordCount example. When reading through the "Simple Example" you'll want to be viewing the code in the userguide-wordcount sub-project.

    Dempsy recognizes three types of developers and the api documentation is organized accordingly.

    Developer Description Api Documentation
    Application Developer Developers that are going to build real-time stream based analytics fall into this category. Application Developer Api Documentation
    Framework Developer Dempsy is built on a set of abstractions that allow it to be extended with new transports, routing strategies, monitoring techniques, as well as others. Developers interested in adding new implementations or techniques fall into this category. This Api extends the Applicaiton Developer Api above. Framework Developer Api Documentation
    Dempsy Contributors Any developer changing the existing framework or any of the default implementations of the core abstractions will need the complete Api documentation for the entire codebase. The Complete Api Documentation

    This Api documentation is currently for the released 0.9 revision.

    Download

    Dempsy is currently released as a set of jar files and version 0.9 is now in the Maven Central Repository so you can build against released Dempsy code directly by including the appropriate dependency.

    To build an application against Dempsy you will need to add the Dempsy dependencies to your build. This should be as simple as including the following dependency in your maven pom.xml file (or the gradle equivalent).

    <dependency>
       <groupId>net.dempsy</groupId>
       <artifactId>lib-dempsyspring</artifactId>
       <version>0.9</version> <!-- Current version -->
    </dependency>
    

    Please review the section on Understanding the Dempsy Codebase for an overview of how the library jar files are organized.

    You can download the source code by cloning the GitHub repository

    git clone git://github.com/Dempsy/Dempsy.git
    

    or you can download a released version of the sourcecode:

    1. Zip file: Dempsy version 0.9 zip
    2. Tar and gzipped: Dempsy version 0.9 tar.gz

    Current Work

    See the issues list for a backlog. The primary tasks are currently:

    1. Full Elasticity - While Dempsy was built from the ground up to be elastic, the ability for the topology to dynamically morph available in the current trunk but not yet in a release version.
    2. AWS push button deployment and an example integration with an elastic provisioning tool

    Getting Help

    The team is available through several channels.

    Overview

    What is Dempsy?

    In a nutshell, Dempsy is a framework that provides for the easy implementation Stream-based, Real-time, BigData applications.

    Dempsy is the Nokia's "Distributed Elastic Message Processing System."

    What Problem is Dempsy solving?

    Dempsy is not designed to be a general purpose framework, but is intended to solve two specific classes of "stream processing" problems while encouraging the use of the best software development practices. These two classes include:

    Statistical Analytics

    Dempsy can be used to solve the problem of processing large amounts of "near real time" stream data with the lowest lag possible. Statistical Analytics are problems where latency is more important than "guaranteed delivery." This class of problems includes use cases such as:

    Deterministic Analytics (Coming in April 2013)

    In the case where guaranteed delivery is more important than lag reduction, Dempsy provides for the ability to run analytics where stream messages are processed with at-least-once semantics:

    It is meant to provide developers with a tool that allows them to solve these problems in a simple straightforward manner by allowing them to concentrate on the analytics themselves rather than the infrastructure. Dempsy heavily emphasizes "separation of concerns" through "dependency injection" and out of the box supports both Spring and Guice. It does all of this by supporting what can be (almost) described as a "distributed actor model."

    In short Dempsy is a framework to enable decomposing a large class of message processing applications into flows of messages to relatively simple processing units implemented as POJOs

    Features

    Important features of Dempsy include:

    Dempsy is intentionally not an “Application Server” and runs in a completely distributed manner relying on Apache ZooKeeper for coordination. In sticking with one of the development teams guiding principles, it doesn’t try to solve problems well solved in other parts of the industry. As DevOps tools become the norm in cloud computing environments, where it’s “easier to re-provision that to repair,” Dempsy defers to such systems the management of computational resources, however, being fully elastic (May 2012), it cooperates to produce true dynamic fault tolerance.

    What is a Distributed Actor Framework?

    Dempsy has been described as a distributed actor framework. While not strictly speaking an actor framework in the sense of Erlang or Akka actors, in that actors typically direct messages directly to other actors, the Message Processors in Dempsy are "actor like POJOs" similar to Processor Elements in S4 and less so like Bolts in Storm. Message processors are similar to actors in that Message processors act on a single message at a time, and need not deal with concurrency directly. Unlike actors, Message Processors also are relieved of the the need to know the destination(s) for their output messages, as this is handled inside Dempsy itself.

    The Actor model is an approach to concurrent programming that has the following features:

    A traditional (linear) programming model processes input sequentially, maintaining whatever state is needed to represent the entire input space. In an "Fine Grained Actor" model, input is divided into messages and distributed to a large number of independent actors. An individual actor maintains only the state needed to process the messages that it receives.

    Each actor maintains its own state, and does not expose that state to any other actor. This eliminates concurrency bottlenecks and the potential for deadlocks. Immutable state may be shared between actors.

    Actors communicate by sending immutable messages to one-another. Each message has a key, and the framework is responsible for directing the message to the actor responsible for that key.

    A distributed actor model takes an additional step, of allowing actors to exist on multiple nodes in a cluster, and supporting communication of messages between nodes. It adds the following complexities to the Actor model:

    A message may or may not be consumed by an actor residing in the same JVM as the actor that sent the message. The required network communication will add delay to processing, and require physical network configuration to support bandwidth requirements and minimize impact to other consumers.

    The framework must distribute work evenly between nodes, potentially using different strategies for different message types (eg: regional grouping for map-matcher, simple round-robin for vehicles).

    If a node fails, the workload on that node must be shifted to other nodes. All state maintained by actors on the failed node is presumed lost.

    If the network connection to a node temporarily drops, it will appear as a node failure to other nodes in the cluster. The node must itself recognize that it is no longer part of the cluster, and its actors must stop sending messages (which may conflict with those sent by the cluster's "replacement" node).

    To support elastic scalability (adding nodes on demand to service load, as well as re-integration of a previously failed node), the framework must support redistribution of actors and their state based on changes to the cluster.

    Guiding philosophy

    Above all, and in many ways, Dempsy is meant to be SIMPLE. It doesn't try to be the solution for every problem. It tries to do one thing well and it is meant to support developers that think this way. Dempsy is built emphasizing, and built to emphasize several interrelated principles. These principles are meant to reduce the longer term total cost of ownership of the software written using Dempsy. These include:

    What about other options?

    Complex Event Processing systems (CEP)

    CEP is really trying to solve a different problem. If you have a large stream of data you want to mine by separating it into subsets and executing different analytics on each subset (which can including ignoring entire subsets), then CEP solutions make sense. If, however, you’re going to do the same thing to every message then you will be underutilizing the power of CEP. Underutilized functionality usually means an increased total cost of ownership, and Dempsy is ALL ABOUT reducing the total cost of ownership for systems that do this type of processing.

    Pure Actor Model Frameworks and Languages

    There are several pure "Actor Model" frameworks and languages that have been posed as alternatives for Dempsy. Dempsy is not a pure actor model and primarily solves a different problem. As described above Dempsy is primarily a routing mechanism for messages for "fine grained" actors. The reason we still (though loosely) call it an "actor model" is because Dempsy supports concurrency the way a typical Actor Model does.

    Other Stream Processors

    Dempsy emphasizes reducing the total cost of ownership of real-time analytics applications and as a direct result we feel it has some advantages over the alternatives.

    First, as mentioned, Dempsy supports “fine grained” message processing. Because of this, by writing parallel use-cases in Dempsy and alternatives that don't support this programming model, we find that Dempsy leads to a lower code-line count.

    Also, because of Dempsy’s emphasis on “Inversion of Control” the resulting applications are more easily testable. With the exception of annotations, Message Processors, which are the atomic unit of work in Dempsy, have no dependency on the framework itself. Every alternative we've found requires that the application be written against and written to use that framework.

    Also, in following the adage to never require the system to be told something that it can deduce, the topology of a Dempsy application’s pipeline is discovered at runtime and doesn’t need to be preconfigured. This is primarily a by-product of the fact that Dempsy was designed from the ground up to be “elastic” and as a result, the topology can morph dynamically.

    This means that applications with complicated topologies with many branches and merges can be trivially configured since the dependency relationship between stages is discovered by the framework.

    Now that you've decided to give it a try