Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming data to s3 using akka streams

939 views

Published on

Slides for lighting talk on reactive systems meetup at Nitro, Dublin

Published in: Software
  • Be the first to comment

Streaming data to s3 using akka streams

  1. 1. Streaming data to S3 using akka-streams Mikhail Girkin Software Engineer GILT HBC Digital @mike_girkin
  2. 2. The problem ● Several big (hundreds Mb) database result sets ● Served as a JSON files ● The service constantly OOM-ing, even on 32Gb instance
  3. 3. Akka-streams ● Library from akka toolbox ● Build on top of actor framework ● Handles streams and their specifics, without exposing actors itself
  4. 4. A bit on akka-streams - Source ● The input of the data in the stream ● Has the output channel to feed data into the stream SQLSource
  5. 5. A bit on akka streams - Sink ● The final point of the data in the stream ● Has the input channel to receive the data from the stream S3 object
  6. 6. Another bit on akka-streams - Flow ● The transformation procedure of the stream ● Takes data from the input, apply some computations to it, and pass the resulting data to the output Serialization
  7. 7. Basic stream operations ● via Source via Flow => Source Flow via Flow => Flow ● to Flow to Sink => Sink Source to Sink => Sink
  8. 8. Declaration is not execution! Stream description is just a declaration, so: val s = Source[Int](Range(1, 100).toList) .via( Flow[Int].map(x => x + 10) ).to( Sink.foreach(println) ) will not execute until you call s.run()
  9. 9. The skeleton Get data -> serialize -> send to S3 def run(): Future[Long] = { val cn = getConnection() val stream = (cn: Connection) => dataSource.streamList(cn) // Source[Item] - get data from the DB .via(serializeFlow) // Flow[Item, Byte] - serialize .toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3 val countFuture = stream(cn).run() countFuture.onComplete { r => cn.close() } countFuture }
  10. 10. Serialize in the stream ● We deal with the single collection ● Type of the items is the same val serializeFlow = Flow[Item] .map(x => serializeItem(x)) // serializeItem: Item => String .intersperse("[", ",", "]") // sort of mkString for the streams .mapConcat[Byte] { x => x.getBytes().toIndexedSeq }
  11. 11. S3 multipart upload API ● Allows to upload files in separate chunks ● Allows to upload chunks in parallel ● Doesn’t have TTL for the chunks uploaded (by default) Simplified methods: 1. initialize(bucket, filename) => uploadId 2. uploadChunk(uploadId, partNumber, content) => hashSum 3. complete()
  12. 12. Lets create an S3 Sink! ● SinkA = Flow to SinkB S3 upload flow Sink.head (first value received) S3 upload sink
  13. 13. S3 upload sink Flow[Byte] .grouped(chunkSize) //split the stream in chunks .zip(Source.fromIterator(() => Iterator.from(1))) //Give the chunks numbers .fold[MultipartUploader] ( //Fold over uploader state initUploader() //initial value - uploader ) { case (uploader, (data, chunkNumber)) => //reduce - returns uploader (!) uploader.uploadChunk(chunkNumber, data.toArray) }.map { uploader => uploader.complete() //close the uploader on completion } .to(Sink.head)
  14. 14. SQL Source Anorm provides akka-stream SQL source libraryDependencies ++= Seq( "com.typesafe.play" %% "anorm-akka" % "version", "com.typesafe.akka" %% "akka-stream" % "version") AkkaStream.source(SQL"SELECT * FROM Test", SqlParser.scalar[String], ColumnAliaser.empty): Source[String] Brings minimal transitive dependencies (!)
  15. 15. Road to production ● Retries in case of S3 errors/failures ● Handle the possible problem during stream execution (ie. failure talking to DB)
  16. 16. 200 OK

×