Developing a Real-time Engine with Akka, Cassandra, and Spray

Developing a Real-time Engine
with Akka, Cassandra, and Spray
Jacob Park

What is Paytm Labs and Paytm?
• Paytm Labs is a data-driven lab focusing on tackling very
difficult problems involving the topics of fraud,
recommendations, ratings, and platforms for Paytm.
• Paytm is the world's fastest growing mobile-first
marketplace and payment ecosystem that serves over 100
million people who make over 1.5 million business
transactions representing $1.7 billion of goods and
services exchanged annually.
2

What is Akka?
• Akka (http://akka.io/):
• “Akka is a toolkit and runtime for building highly
concurrent, distributed, and resilient message-driven
applications on the JVM.”
• Packages: “akka-actor”, “akka-remote”, “akka-cluster”,
“akka-persistence”, “akka-http”, and “akka-stream”.
3

What is Cassandra?
• Cassandra (http://cassandra.apache.org/):
• “The Apache Cassandra database is the right choice
when you need scalability and high availability without
compromising performance.”
4

What is Spray?
• Spray (http://spray.io/):
• “Spray is an open-source toolkit for building REST/HTTP-
based integration layers on top of Scala and Akka.”
• Packages: “spray-caching”, “spray-can”, “spray-http”,
“spray-httpx”, “spray-io”, “spray-json”, “spray-routing”,
“spray-servlet”.
5

What is Maquette?
• A real-time fraud rule-engine which enables synchronous
calls for core operational platforms to evaluate fraud.
• Its core technologies include Akka, Cassandra, and Spray.
6

Why Akka, Cassandra, and Spray?
• Akka, Cassandra, and Spray are highly performant,
developer-friendly, treat failures as a first-class concept,
provide great support for clustering to ensure
responsiveness, resiliency, and elasticity when creating
Reactive Systems.
7

Maquette In a Nutshell
8
HTTP Environment Executor

HTTP Layer
• Utilize Spray-Can for a fast HTTP endpoint.
• Utilize Jackson for JSON deserialization/serialization.
• Utilize a separate dispatcher for the Bulkhead Pattern.
• Expose a normalized yet flexible schema for integration.
• Request Handling: Worst → Best
• Cameo Pattern (Per-request Actor),
• Ask Pattern (Future),
• RequestHandlerPool (Akka Router Pool).
10

HTTP Layer
trait FraudRoute extends BaseRoute with ActorLogging {
this: Actor =>
import SprayJacksonSupportUtils._
override protected def receiveRequest(
delegateActorRef: ActorRef, parentUriPath: Path
): Actor.Receive = {
case incomingHttpRequest @ HttpRequest(
HttpMethods.POST, requestUri, requestHeaders, requestEntity, requestProtocol
)
if requestUri.path startsWith parentUriPath =>
val senderActorRef = sender()
unmarshalHttpEntityAndDelegateRequest(
requestEntity, delegateActorRef, senderActorRef
)
}
}
11

Environment Layer
• A tree of actors which are responsible for managing a
cache or pool of Contexts and Dependencies required to
evaluate incoming requests.
• A Context is a Document Message which wraps
configurations for evaluating requests.
• A Dependency is a Document Message which wraps
optimized queries to Cassandra.
12

Environment Layer
• Map incoming requests to a Context by forking a template
with .copy().
• Forward the forked Context to Executor Layer in the same
or different JVM with Akka Router.
• Consider implementing a custom router to favour locality
of execution on the same JVM until responsiveness
requires distribution.
13

Environment Layer
• Always pre-compute and pre-optimize the Environment
Layer as a whole.
• Allow the capability to remotely pre-compute and update
Contexts.
• Ensure Contexts and Dependencies are designed for
optimization by allowing arithmetic reduction or sorts.
• Having a ProxyActor and StateActor for an
EnvironmentActor is preferred to ensure caching of the
whole environment to recover from failures fast.
14

Environment Layer
type EnvironmentStateActorRefFactory =
(EnvironmentProxyActorContext, EnvironmentProxyActorSelf) => ActorRef
type EnvironmentActorRefFactory =
(EnvironmentProxyActorContext, EnvironmentProxyActorSelf) => ActorRef
class EnvironmentProxyActor(
environmentStateActorRefFactory: EnvironmentStateActorRefFactory,
environmentActorRefFactory: EnvironmentActorRefFactory
) extends Actor with ActorLogging {
val environmentStateActorRef = environmentStateActorRefFactory(context, self)
val environmentActorRef = environmentActorRefFactory(context, self)
override def receive: Receive =
receiveEnvironmentState orElse
receiveFraudRequest orElse
receiveEnvironmentLocalCommand orElse
receiveEnvironmentRemoteCommand
} 15

Environment Layer
class EnvironmentStateActor(
environmentProxyActorRef: ActorRef, databaseInstance: Database
import EnvironmentStateActor._
import EnvironmentStateFactory._
import EnvironmentStateLifecycleStrategy._
import EnvironmentStateRepository._
var environmentState: Option[EnvironmentState] = None
receiveLocalCommand orElse
receiveRemoteCommand
object EnvironmentStateLifecycleStrategy { ... }
object EnvironmentStateFactory { ... }
object EnvironmentStateRepository { ... }
}
16

Environment Layer
class EnvironmentActor(
environmentProxyActor: ActorRef, executorActorRef: ActorRef, bootActorRef: ActorRef
import EnvironmentActor._
import EnvironmentLifecycleStrategy._
var environmentState: Option[EnvironmentState] = None
receiveEnvironmentState orElse
receiveFraudRequest
def forkedMaquetteContext(fraudRequest: FraudRequest): Option[MaquetteContext] = {
val forkedMaquetteContextOption = for {
actualEnvironmentState <- environmentState
actualBaseMaquetteContext <- actualEnvironmentState.maquetteContextMap.
get(fraudRequest.evaluationType)
actualForkMaquetteContext = actualBaseMaquetteContext.
copy(fraudRequest = fraudRequest)
} yield actualForkMaquetteContext
forkedMaquetteContextOption
}
}
17

Executor Layer
• A pipeline of actors responsible for scheduling execution of
Tasks defined within a Context with the specified
Dependencies, executing the Tasks, and coordinating the
results of the Tasks to provide a response.
• A Task is an optimized set of executable rules.
18

Executor Layer
• Ideally, an Execution Layer should be stateless to allow
easy recovery from failures.
• Ideally, keep the Execution Layer available across the
cluster.
19

Executor Layer
type ExecutorRouterActorRefFactory =
(ExecutorActorContext, ExecutorActorSelf) => ActorRef
type ExecutorCoordinatorActorRefFactory =
(ExecutorActorContext, ExecutorActorSender, ExecutorActorNext, MaquetteContext, Timeout) =>
ActorRef
class ExecutorActor(
executorRouterActorRefFactory: ExecutorRouterActorRefFactory,
executorCoordinatorActorRefFactory: ExecutorCoordinatorActorRefFactory,
actionActorRef: ActorRef
import ExecutorActor._
import ExecutorSchedulerStrategy._
val executorRouterActorRef: ActorRef = executorRouterActorRefFactory(context, self)
receiveMaquetteContext orElse
receiveMaquetteResult
object ExecutorSchedulerStrategy {
def scheduleExecution(maquetteContext: MaquetteContext): Unit = { ... }
}
}
20

Executor Layer
• Design a Task as a functional and monadic data structure.
• Utilizing functional programming, the Task should isolate
side effects from functions.
• Utilizing Monads, the Task becomes easily optimizable
with its properties for composition or reduction which
allows high parallelization.
21

Executor Layer
case class Query(
selectComponent: Select, fromComponent: From, whereComponent: Where
) {
def + (that: Query): Query = {
this.copy(selectComponent =
Select(this.selectComponent.columnNames union
that.selectComponent.columnNames)
)
}
def - (that: Query): Query = {
this.copy(selectComponent =
Select(this.selectComponent.columnNames diff
that.selectComponent.columnNames)
)
}
}
22
Note: An example of a Rule object is not shown as it is a trade secret.

Executor Layer
• For a Task object, consider the use of an external DSL to
interpret into executable and immutable graphs and even
Java byte code.
• Scala Parser Combinators:
https://github.com/scala/scala-parser-combinators
• Parboiled2: https://github.com/sirthias/parboiled2
• ANTLR: http://www.antlr.org/
23

Executor Layer
object QueryParser extends JavaTokenParsers {
def parseQuery(queryString: String): Try[Query] = {
parseAll(queryStatement, queryString) ...
}
object QueryGrammar {
lazy val queryStatement: Parser[Query] =
selectClause ~ fromClause ~ opt(whereClause) ~ ";" ^^ {
case selectComponent ~ fromComponent ~ whereComponent ~ ";" =>
Query(selectComponent, fromComponent, whereComponent.getOrElse(Where.Empty))
}
}
object SelectGrammar { ... }
object FromGrammar { ... }
object WhereGrammar { ... }
object StaticClauseGrammar { ... }
object DynamicClauseGrammar { ... }
object InterpolationTypeGrammar { ... }
object DataTypeGrammar { ... }
object LexicalGrammar { ... }
}
24
Note: An example of a Rule parser is not shown as it is a trade secret.

Abstracting Concurrency for High Parallelism Tasks
• Scala Futures.
• Scala Parallel Collections.
• Akka Router Pool.
• Akka Streams.
25

Scala Futures
• “A Future is an object holding a value which may become
available at some point.”
26
val f = for {
a <- Future(10 / 2)
b <- Future(a + 1)
c <- Future(a - 1)
if c > 3
} yield b * c
f foreach println

Scala Futures
• Advantages: Efficient, Highly Parallel, Simple Monadic
Abstraction.
• Disadvantages: Lacks Communication, Lacks Low-Level
Concurrency Control, JVM Bound.
• Note: Monadic Futures Enqueue All Operations to ExecutionContext
⇒ Lack of Control over Context-Switching.
27

Scala Parallel Collections
• Scala Parallel Collections is a package in the Scala
standard library which allows collections to execute
operations in parallel.
28
(0 until 100000).par
.filter(x => x.toString == x.toString.reverse)

Scala Parallel Collections
• Advantages: Very Efficient, Highly Parallel, Control of
Parallelism Level.
• Disadvantages: Lacks Communication, Non-parallelizable
Operations (foldLeft() and aggregate()), Non-
deterministic and Side Effects Issues for Degree of
Abstraction, JVM-Bound.
29

Akka Router Pool
• An Akka Router Pool maintains pool of child actors to
forward messages.
• If an Akka Router Pool is configured with an appropriate
dispatcher, mailbox, supervisor, and routing logic, it allows
a highly parallel yet elastic construct to execute tasks.
30

Akka Router Pool
val routerSupervisionStrategy = OneForOneStrategy() {
case _ => SupervisorStrategy.Restart
}
val routerPool = FromConfig.
withSupervisorStrategy(routerSupervisionStrategy)
val routerProps = routerPool.props(
ExecutorWorkerActor.props(accessLayer).
withDispatcher(DispatcherConfigPath)
)
context.actorOf(
props = routerProps,
name = RouterName
)
31

Akka Router Pool
• Advantages:
• Work-Pull Pattern = Rate Limiting.
• Bounded Mailbox = Backpressure.
• SupervisionStrategy = Failure.
• Scheduler = Timeout.
• Router Resizer = Predictive Parallelism & Scaling.
• Dispatcher Throughput = Predictive Context Switching.
• Location Transparency = JVM Unbound.
32

Akka Router Pool
• Disadvantages:
• Complex optimizations or implementation required.
• Actors with state potentially lead to issues regarding
mutability and lack of idempotence.
• Actors which require communication beyond parent-child
trees lead to potentially complex graphs.
33

Akka Steams
• “Akka Streams is an implementation of Reactive Streams,
which is a standard for asynchronous stream processing
with non-blocking backpressure.”
34
implicit val system = ActorSystem("reactive-tweets")
implicit val materializer = ActorMaterializer()
val authors: Source[Author, Unit] =
tweets
.filter(_.hashtags.contains(akka))
.map(_.author)
authors.runWith(Sink.foreach(println))

Akka Steams
• Advantages: Backpressure and Failure as First-class
Concepts, Concurrency Control, Simple Monadic
Abstraction, Graph API, Bi-directional Channels.
• Disadvantages: Too New = Risk for Production.
• Current: JVM Bounded; Potentially: Distributed
Streaming.
• Current: No Graph Optimization; Potentially: Macro-
based Optimization.
35

Maquette Performance
• With 10 Cassandra nodes, 4 Maquette nodes, and an HA
Proxy as a staging environment, ~40 000 requests per
second with a mean 10 millisecond response time with 50
rules.
36

Tips
• Investigate Akka Streams for Akka HTTP.
• Investigate CPU usage and memory consumption: YourKit
or VisualVM and Eclipse MAT.
• Utilize Kamon for real-time metrics to StatsD or a third-
party service like Datadog.
• If implementing a DSL or a complex actor-based graph,
remember to utilize ScalaTest and Akka TestKit properly.
• Utilize Gatling.io for load and scenario based testing.
37

Tips
• We used Cassandra 2.1.6 as our main data store for
Maquette. We experienced many pains with operating
Cassandra.
• Mastering Apache Cassandra (2nd Edition):
http://www.amazon.com/Mastering-Apache-Cassandra-
Second-Edition-ebook/dp/B00VAG2WZO
38

Tips
• Investigate the Play Framework with Akka Cluster to create
a web application for operations.
• Commands to operate instances in the cluster.
• Commands to configure instances in real-time.
• GUI interface for data scientists and business analysts to
easily define and configure rules.
39

Tips
• Utilize Kafka to publish audits which can be utilized to
monitor rules through an Logstash, Elasticsearch, and
Kibana flow, and archived in a HDFS.
• Consider Kafka to replay audits as requests to run real-time
engine offline for tuning rules.
40

Resources
• The Reactive Manifesto:
• http://www.reactivemanifesto.org/
• Reactive Messaging Patterns with the Actor Model:
• http://www.amazon.ca/Reactive-Messaging-Patterns-Actor-
Model/dp/0133846830
• Learning Concurrent Programming in Scala:
• http://www.amazon.com/Learning-Concurrent-Programming-Aleksandar-
Prokopec/dp/1783281413
• Akka Concurrency:
• http://www.amazon.ca/Akka-Concurrency-Derek-Wyatt/dp/0981531660
41

Thank you!
Jacob Park
Phone Number Removed
jacob@paytm.com
park.jacob.96@gmail.com

Developing a Real-time Engine with Akka, Cassandra, and Spray

More Related Content

What's hot

Viewers also liked

Similar to Developing a Real-time Engine with Akka, Cassandra, and Spray

Recently uploaded

Developing a Real-time Engine with Akka, Cassandra, and Spray