Streams as Scala Collections
S3 Scala Client with Play Iteratees and Composable Operations
Greg Silin
Platform Engineer
@gregsbriefs
www.github.com/nitro/streamcollections
ScalaDays 2015
Agenda
• Reactive at Nitro
• Smart Documents at Scale
• Motivation for Streaming Collections
• Building Streams with Iteratees
• Streams as Scala Collections
• Applications
• Questions
The Old Way
Create & Prepare
On the Desktop
Print
Document
Sign Printed
Document
Scan Into
Computer
Knowledge workers spend approximately 11+ hours a week
creating and managing documents
The New Way
Create Prepare
Sign
(Anywhere)
Nitro accelerates the way
businesses create, prepare, and
sign documents.
Anytime and anywhere.
Smarter Documents for EveryoneTM
Reactive Systems at Nitro
react to user expectations <- responsive
react to state changes <- message driven
react to variable load <- elastic
react to failure <- resilient
Smart Documents at Scale
multiple pages
and formats
per document
Smart Documents at Scale
Each action results in a new document version
render sign approve
...
Smart Documents at Scale
documents / second *
versions / document *
pages / version =
billions of objects in S3
Smart Documents at Scale
millions of new document uploads a day
100MM+/day document state changes resulting in 10x messages
billions of objects in S3
Motivation for Streaming Collections
counting
copying
extracting
cleanup
become non-trivial at scale
Motivation for Streaming Collections
1 percent error margin = 10M objects
That’s money for the business
How?
How do we traverse the data?
How?
Command line tools don’t provide flexibility / scale
How?
Can’t load everything in memory
Command line tools don’t provide flexibility / scale
How?
Can’t load everything in memory
Need some batched solution
Command line tools don’t provide flexibility / scale
How?
Amazon S3 SDK has a Java key iterator
How?
Amazon S3 SDK has a Java key iterator
How?
...
Amazon S3 SDK has a Java key iterator
But we are Scala engineers!
How?
How?
Streaming is a natural fit
Amazon SDK has a Java key iterator
How?
Streaming is a natural fit
We are reactive
Amazon SDK has a Java key iterator
How?
Streaming is a natural fit
Amazon SDK has a Java key iterator
Thus asynchronous streams
We are reactive
How?
Streaming is a natural fit
Amazon SDK has a Java key iterator
Thus asynchronous streams
We are reactive
Can’t over-parallelize
What Streams?
Enter Play Iteratees
Enumerator - Source
Enumeratee - Transformer
Iteratee - Consumer / Sink
Building Streams with Iteratees
Why Play Iteratees?
Building Streams with Iteratees
Why Play Iteratees?
Most mature technology at the time
Building Streams with Iteratees
Why Play Iteratees?
Most mature technology at the time
Production Experience
Building Streams with Iteratees
Play Iteratees via a counting example
Building Streams with Iteratees
Enumerator = Source
Building Streams with Iteratees
Enumeratee = Transformer
Building Streams with Iteratees
Iteratee = Sink / Reduce
Building Streams with Iteratees
Tying things together...
Building Streams with Iteratees
Can this be simplified?
Streams as Scala Collections
We are all familiar with Scala collections
Streams as Scala Collections
We are all familiar with Scala collections
map
filter
foreach
grouped
count
Streams as Scala Collections
Can reason about iteratee streams as a collection
Streams as Scala Collections
Can now redo our grouped & count example
Streams as Scala Collections
Can now redo our grouped & count example
Streams as Scala Collections
With the internals hidden, my counting code becomes simple
Streams as Scala Collections - Examples
Cleaning up files
Streams as Scala Collections - Examples
Extract data by date
Streams as Scala Collections - Applications
Can extend this model onto other data
sources
We don’t have to stop at S3
➔ Relational DB
➔ ElasticSearch
➔ HBase / Cassandra
➔ Spark
"Much of my work has come from being lazy." - John Backus
Quoted in the IBM employee magazine Think in 1979 (http://en.wikiquote.org/wiki/John_Backus)
What We Learned
Iteratees are good for traversing large volume of data
Programming iteratees can get a bit tricky
Scaling ain’t easy
Stream Collections abstraction makes streams simple
Future of Streams as Scala Collections
Continue developing a reactive S3 Client
In use in Nitro Production
Introduce other stream implementations
(akka streams, etc)
www.github.com/nitro/streamcollections
Contributors:
www.github.com/gregsilin / @gregsbriefs
www.github.com/mkolod / @marekinfo
Open Sourcing
Are you interested? We welcome collaborators!
San Francisco Scala Days 2015
• Nitro is a Gold sponsor
• Meet us at our community booth
sfscala.org:
• Wed: Scala D’Ehs meetup @ Stock in Trade
• Thu: unconference @ Galvanize
• Thu evening: Spark Notebook & Rapture @ Nitro
• Fri: free Shapeless training @ Nitro
We Are Hiring!
gonitro.com/about/jobs
Questions?
@gregsbriefs
greg.silin@gonitro.com

Stream Collections - Scala Days