SlideShare a Scribd company logo
Using akka-streams to
access S3 objects
Mikhail Girkin
Software Engineer
GILT
HBC Digital
@mike_girkin
Codez? Codez!
https://github.com/gilt/gfc-aws-s3
Initial problem
● Several big (hundreds Mb) database result sets
● All data cached in memory
● Served as a JSON files
● The service constantly OOM-ing, even on 32Gb instance
Akka-streams
● Library from akka toolbox
● Build on top of actor framework
● Handles streams and their specifics, without exposing
actors itself
What is “stream”
● Sequence of objects
● Has an input
● Has an output
● Defined as a sequence of data transformations
● Could be infinite
● Steps could be executed independently
Stream input - Source
● The input of the data in the stream
● Has the output channel to feed data into the stream
SQLSource
Stream output - Sink
● The final point of the data in the stream
● Has the input channel to receive the data from the stream
S3 object
Processing - Flow
● The transformation procedure of the stream
● Takes data from the input, apply some computations to it,
and pass the resulting data to the output
Serialization
Basic stream operations
● via
Source via Flow =>
Source
Flow via Flow =>
Flow
● to
Flow to Sink =>
Sink
Source to Sink =>
Stream
Declaration is not execution!
Stream description is just a declaration, so:
val s = Source[Int](Range(1, 100).toList)
.via(
Flow[Int].map(x => x + 10)
).to(
Sink.foreach(println)
)
will not execute until you call
s.run()
The skeleton
Get data -> serialize -> send to S3
def run(): Future[Long] = {
val cn = getConnection()
val stream = (cn: Connection) =>
dataSource.streamList(cn) // Source[Item] - get data from the DB
.via(serializeFlow) // Flow[Item, Byte] - serialize
.toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3
val countFuture = stream(cn).run()
countFuture.onComplete { r =>
cn.close()
}
countFuture
}
Serialize in the stream
● We are dealing with the single collection
● Type of the items is the same
val serializeFlow = Flow[Item]
.map(x => serializeItem(x)) // serializeItem: Item => String
.intersperse("[", ",", "]") // sort of mkString for the streams
.mapConcat[Byte] { // mapConcat ≈ flatMap
x => x.getBytes().toIndexedSeq
}
S3 multipart upload API
● Allows to upload files in separate chunks
● Allows to upload chunks in parallel
● (!) By default doesn’t have TTL for the chunks uploaded
Simplified API:
1. initialize(bucket, filename) => uploadId
2. uploadChunk(uploadId, partNumber, content) => etag
3. complete(uploadId, List[etag])
Resource access
● Pattern: Open - Do stuff - Close
open: () => TState
onEach: (TState, TItem) => (TState)
close: TState => TResult
● Functional pattern - fold over the state
○ With an additional call in the end
● Akka-streams lacks Sink of that type
● Calls open lazily, on arrival of the first element of the stream
Lets create a new sink!
class FoldResourceSink[TState, TItem, Mat](
open: () => TState,
onEach: (TState, TItem) => (TState),
close: TState => Mat
) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … }
Methods to write:
def onPush(): Unit
override def preStart(): Unit
override def onUpstreamFinish(): Unit
override def onUpstreamFailure(ex: Throwable): Unit
S3Sink from ResourceFoldSink
● SinkA = Flow to SinkB
S3 upload flow FoldResourceSink
S3 upload sink
What is TState and TItem?
We need to keep track of: uploadId, etags and uploadedLentgh to the moment
case class S3MultipartUploaderState(
uploadId: String,
etags: List[PartETag],
totalLength: Long
)
And item is:
(ByteString, Int) // (content, chunkNumber)
FoldResourceSink for S3
val sink =
Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long](
() => initUpload(), //Returns state
{ case (state, (chunk, chunkNumber)) =>
uploadChunk(state, chunk, chunkNumber) },
completeUpload //Accepts state
)
Flow[Byte]
.grouped(chunkSize)
.map(b => ByteString(b:_*))
.zip(
Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber)
).toMat(sink)(Keep.right)
SQL Source
Anorm provides akka-stream SQL source
libraryDependencies ++= Seq(
"com.typesafe.play" %% "anorm-akka" % "version",
"com.typesafe.akka" %% "akka-stream" % "version")
AkkaStream.source(SQL"SELECT * FROM Test",
SqlParser.scalar[String], ColumnAliaser.empty): Source[String]
Brings minimal transitive dependencies (!)
Road to production
● Retries in case of S3 errors/failures
○ S3 client handles this
● Handle the possible problem during stream execution (ie.
failure talking to DB)
○ When stream fails - it never calls complete
Could we do it other
way round?
● S3 tends to timeout and drop connection on slow download of large files
● Ability to process data in a streaming manner
S3 protocol for partial downloads
● By parts (see multipart upload)
○ Uses part numbers
○ Doesn’t work when upload wasn’t multipart
○ Amazon says it’s faster
● By chunks
○ Chunk is defined by (from, to) byte numbers
○ Works for any file, and any chunk length
○ Amazon says it’s slow
Basic idea
1. Get part count
2. For each part create an akka source
3. Combine the individual streams into one
1. Get file length
2. For chunk in file create an akka source
3. Combine the individual streams into one
Create akka source from IO stream:
val stream: InputStream = …
Source.fromInputStream(stream)
Downloading by parts
Source.single(getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]
Downloading by parts
Source.single(Unit)
.map(
_ => getPartCount(s3Client, bucketName, key)
).flatMapConcat { partCount =>
Source(
Range(firstPartIndex, partCount + firstPartIndex)
)
}.flatMapConcat { partNumber =>
Source.fromInputStream(
getS3ObjectContent(partNumber, readMemoryBufferSize),
)
} // Type - Source[ByteString, NotUsed]
gfc-aws-s3 https://github.com/gilt/gfc-aws-s3
Opensource project containing the code above (Sources and Sink)
Also s3-http as an educational example
Codez!
200 OK

More Related Content

What's hot

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Till Rohrmann
 
A dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioA dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenario
Gioia Ballin
 
My Gentle Introduction to RxJS
My Gentle Introduction to RxJSMy Gentle Introduction to RxJS
My Gentle Introduction to RxJS
Mattia Occhiuto
 
Reactive streams processing using Akka Streams
Reactive streams processing using Akka StreamsReactive streams processing using Akka Streams
Reactive streams processing using Akka Streams
Johan Andrén
 
Introduction to rx java for android
Introduction to rx java for androidIntroduction to rx java for android
Introduction to rx java for android
Esa Firman
 
Intro to ReactiveCocoa
Intro to ReactiveCocoaIntro to ReactiveCocoa
Intro to ReactiveCocoa
kleneau
 
Reactive Applications in Java
Reactive Applications in JavaReactive Applications in Java
Reactive Applications in Java
Alexander Mrynskyi
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJava
Jobaer Chowdhury
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams
Johan Andrén
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
Johan Andrén
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Intro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich AndroidIntro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich Android
Egor Andreevich
 
Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)
Michal Grman
 
Scalable Applications with Scala
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with ScalaNimrod Argov
 
Introduction to RxJS
Introduction to RxJSIntroduction to RxJS
Introduction to RxJS
Brainhub
 
Practical RxJava for Android
Practical RxJava for AndroidPractical RxJava for Android
Practical RxJava for Android
Tomáš Kypta
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
ucelebi
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
Petr Zapletal
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJava
Rick Warren
 

What's hot (20)

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015
 
A dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenarioA dive into akka streams: from the basics to a real-world scenario
A dive into akka streams: from the basics to a real-world scenario
 
My Gentle Introduction to RxJS
My Gentle Introduction to RxJSMy Gentle Introduction to RxJS
My Gentle Introduction to RxJS
 
Reactive streams processing using Akka Streams
Reactive streams processing using Akka StreamsReactive streams processing using Akka Streams
Reactive streams processing using Akka Streams
 
Introduction to rx java for android
Introduction to rx java for androidIntroduction to rx java for android
Introduction to rx java for android
 
Intro to ReactiveCocoa
Intro to ReactiveCocoaIntro to ReactiveCocoa
Intro to ReactiveCocoa
 
Reactive Applications in Java
Reactive Applications in JavaReactive Applications in Java
Reactive Applications in Java
 
Reactive programming with RxJava
Reactive programming with RxJavaReactive programming with RxJava
Reactive programming with RxJava
 
Reactive stream processing using Akka streams
Reactive stream processing using Akka streams Reactive stream processing using Akka streams
Reactive stream processing using Akka streams
 
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overviewFlink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
Flink Forward SF 2017: Kenneth Knowles - Back to Sessions overview
 
Asynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka StreamsAsynchronous stream processing with Akka Streams
Asynchronous stream processing with Akka Streams
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Intro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich AndroidIntro to RxJava/RxAndroid - GDG Munich Android
Intro to RxJava/RxAndroid - GDG Munich Android
 
Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)Functional Reactive Programming (CocoaHeads Bratislava)
Functional Reactive Programming (CocoaHeads Bratislava)
 
Scalable Applications with Scala
Scalable Applications with ScalaScalable Applications with Scala
Scalable Applications with Scala
 
Introduction to RxJS
Introduction to RxJSIntroduction to RxJS
Introduction to RxJS
 
Practical RxJava for Android
Practical RxJava for AndroidPractical RxJava for Android
Practical RxJava for Android
 
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...
 
Distributed Real-Time Stream Processing: Why and How 2.0
Distributed Real-Time Stream Processing:  Why and How 2.0Distributed Real-Time Stream Processing:  Why and How 2.0
Distributed Real-Time Stream Processing: Why and How 2.0
 
Building Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJavaBuilding Scalable Stateless Applications with RxJava
Building Scalable Stateless Applications with RxJava
 

Similar to Using akka streams to access s3 objects

CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
Ortus Solutions, Corp
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
Ortus Solutions, Corp
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Databricks
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
Evan Chan
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Thomas Weise
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
Iván Fernández Perea
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
Databricks
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
Arun prasath
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
Prakash Chockalingam
 
Streaming Data with scalaz-stream
Streaming Data with scalaz-streamStreaming Data with scalaz-stream
Streaming Data with scalaz-stream
GaryCoady
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lightbend
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
Gal Marder
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Guido Schmutz
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
Databricks
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
Jean-Baptiste Onofré
 
Intro to Akka Streams
Intro to Akka StreamsIntro to Akka Streams
Intro to Akka Streams
Michael Kendra
 

Similar to Using akka streams to access s3 objects (20)

CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
Porting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to RustPorting a Streaming Pipeline from Scala to Rust
Porting a Streaming Pipeline from Scala to Rust
 
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
Python Streaming Pipelines on Flink - Beam Meetup at Lyft 2019
 
Google cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache FlinkGoogle cloud Dataflow & Apache Flink
Google cloud Dataflow & Apache Flink
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Elk with Openstack
Elk with OpenstackElk with Openstack
Elk with Openstack
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 
Streaming Data with scalaz-stream
Streaming Data with scalaz-streamStreaming Data with scalaz-stream
Streaming Data with scalaz-stream
 
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...
 
Stream processing from single node to a cluster
Stream processing from single node to a clusterStream processing from single node to a cluster
Stream processing from single node to a cluster
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Intro to Akka Streams
Intro to Akka StreamsIntro to Akka Streams
Intro to Akka Streams
 

Recently uploaded

BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Jay Das
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
Tier1 app
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
Juraj Vysvader
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Globus
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
Max Andersen
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Mind IT Systems
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
Google
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
Tendenci - The Open Source AMS (Association Management Software)
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
wottaspaceseo
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
e20449
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
Globus
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
Philip Schwarz
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
rickgrimesss22
 

Recently uploaded (20)

BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdfEnhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
Enhancing Project Management Efficiency_ Leveraging AI Tools like ChatGPT.pdf
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
In 2015, I used to write extensions for Joomla, WordPress, phpBB3, etc and I ...
 
Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...Developing Distributed High-performance Computing Capabilities of an Open Sci...
Developing Distributed High-performance Computing Capabilities of an Open Sci...
 
Quarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden ExtensionsQuarkus Hidden and Forbidden Extensions
Quarkus Hidden and Forbidden Extensions
 
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
Custom Healthcare Software for Managing Chronic Conditions and Remote Patient...
 
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing SuiteAI Pilot Review: The World’s First Virtual Assistant Marketing Suite
AI Pilot Review: The World’s First Virtual Assistant Marketing Suite
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
How Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptxHow Recreation Management Software Can Streamline Your Operations.pptx
How Recreation Management Software Can Streamline Your Operations.pptx
 
Graphic Design Crash Course for beginners
Graphic Design Crash Course for beginnersGraphic Design Crash Course for beginners
Graphic Design Crash Course for beginners
 
Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024Globus Connect Server Deep Dive - GlobusWorld 2024
Globus Connect Server Deep Dive - GlobusWorld 2024
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
First Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User EndpointsFirst Steps with Globus Compute Multi-User Endpoints
First Steps with Globus Compute Multi-User Endpoints
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 
A Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of PassageA Sighting of filterA in Typelevel Rite of Passage
A Sighting of filterA in Typelevel Rite of Passage
 
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptxTop Features to Include in Your Winzo Clone App for Business Growth (4).pptx
Top Features to Include in Your Winzo Clone App for Business Growth (4).pptx
 

Using akka streams to access s3 objects

  • 1. Using akka-streams to access S3 objects Mikhail Girkin Software Engineer GILT HBC Digital @mike_girkin
  • 3. Initial problem ● Several big (hundreds Mb) database result sets ● All data cached in memory ● Served as a JSON files ● The service constantly OOM-ing, even on 32Gb instance
  • 4. Akka-streams ● Library from akka toolbox ● Build on top of actor framework ● Handles streams and their specifics, without exposing actors itself
  • 5. What is “stream” ● Sequence of objects ● Has an input ● Has an output ● Defined as a sequence of data transformations ● Could be infinite ● Steps could be executed independently
  • 6. Stream input - Source ● The input of the data in the stream ● Has the output channel to feed data into the stream SQLSource
  • 7. Stream output - Sink ● The final point of the data in the stream ● Has the input channel to receive the data from the stream S3 object
  • 8. Processing - Flow ● The transformation procedure of the stream ● Takes data from the input, apply some computations to it, and pass the resulting data to the output Serialization
  • 9. Basic stream operations ● via Source via Flow => Source Flow via Flow => Flow ● to Flow to Sink => Sink Source to Sink => Stream
  • 10. Declaration is not execution! Stream description is just a declaration, so: val s = Source[Int](Range(1, 100).toList) .via( Flow[Int].map(x => x + 10) ).to( Sink.foreach(println) ) will not execute until you call s.run()
  • 11. The skeleton Get data -> serialize -> send to S3 def run(): Future[Long] = { val cn = getConnection() val stream = (cn: Connection) => dataSource.streamList(cn) // Source[Item] - get data from the DB .via(serializeFlow) // Flow[Item, Byte] - serialize .toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3 val countFuture = stream(cn).run() countFuture.onComplete { r => cn.close() } countFuture }
  • 12. Serialize in the stream ● We are dealing with the single collection ● Type of the items is the same val serializeFlow = Flow[Item] .map(x => serializeItem(x)) // serializeItem: Item => String .intersperse("[", ",", "]") // sort of mkString for the streams .mapConcat[Byte] { // mapConcat ≈ flatMap x => x.getBytes().toIndexedSeq }
  • 13. S3 multipart upload API ● Allows to upload files in separate chunks ● Allows to upload chunks in parallel ● (!) By default doesn’t have TTL for the chunks uploaded Simplified API: 1. initialize(bucket, filename) => uploadId 2. uploadChunk(uploadId, partNumber, content) => etag 3. complete(uploadId, List[etag])
  • 14. Resource access ● Pattern: Open - Do stuff - Close open: () => TState onEach: (TState, TItem) => (TState) close: TState => TResult ● Functional pattern - fold over the state ○ With an additional call in the end ● Akka-streams lacks Sink of that type ● Calls open lazily, on arrival of the first element of the stream
  • 15. Lets create a new sink! class FoldResourceSink[TState, TItem, Mat]( open: () => TState, onEach: (TState, TItem) => (TState), close: TState => Mat ) extends GraphStageWithMaterializedValue[SinkShape[TItem], Future[Mat]] { … } Methods to write: def onPush(): Unit override def preStart(): Unit override def onUpstreamFinish(): Unit override def onUpstreamFailure(ex: Throwable): Unit
  • 16. S3Sink from ResourceFoldSink ● SinkA = Flow to SinkB S3 upload flow FoldResourceSink S3 upload sink
  • 17. What is TState and TItem? We need to keep track of: uploadId, etags and uploadedLentgh to the moment case class S3MultipartUploaderState( uploadId: String, etags: List[PartETag], totalLength: Long ) And item is: (ByteString, Int) // (content, chunkNumber)
  • 18. FoldResourceSink for S3 val sink = Sink.foldResource[S3MultipartUploaderState, (ByteString, Int), Long]( () => initUpload(), //Returns state { case (state, (chunk, chunkNumber)) => uploadChunk(state, chunk, chunkNumber) }, completeUpload //Accepts state ) Flow[Byte] .grouped(chunkSize) .map(b => ByteString(b:_*)) .zip( Source.fromIterator(() => Iterator.from(1)) //pairs (content, partNumber) ).toMat(sink)(Keep.right)
  • 19. SQL Source Anorm provides akka-stream SQL source libraryDependencies ++= Seq( "com.typesafe.play" %% "anorm-akka" % "version", "com.typesafe.akka" %% "akka-stream" % "version") AkkaStream.source(SQL"SELECT * FROM Test", SqlParser.scalar[String], ColumnAliaser.empty): Source[String] Brings minimal transitive dependencies (!)
  • 20. Road to production ● Retries in case of S3 errors/failures ○ S3 client handles this ● Handle the possible problem during stream execution (ie. failure talking to DB) ○ When stream fails - it never calls complete
  • 21. Could we do it other way round? ● S3 tends to timeout and drop connection on slow download of large files ● Ability to process data in a streaming manner
  • 22. S3 protocol for partial downloads ● By parts (see multipart upload) ○ Uses part numbers ○ Doesn’t work when upload wasn’t multipart ○ Amazon says it’s faster ● By chunks ○ Chunk is defined by (from, to) byte numbers ○ Works for any file, and any chunk length ○ Amazon says it’s slow
  • 23. Basic idea 1. Get part count 2. For each part create an akka source 3. Combine the individual streams into one 1. Get file length 2. For chunk in file create an akka source 3. Combine the individual streams into one Create akka source from IO stream: val stream: InputStream = … Source.fromInputStream(stream)
  • 24. Downloading by parts Source.single(getPartCount(s3Client, bucketName, key) ).flatMapConcat { partCount => Source( Range(firstPartIndex, partCount + firstPartIndex) ) }.flatMapConcat { partNumber => Source.fromInputStream( getS3ObjectContent(partNumber, readMemoryBufferSize), ) } // Type - Source[ByteString, NotUsed]
  • 25. Downloading by parts Source.single(Unit) .map( _ => getPartCount(s3Client, bucketName, key) ).flatMapConcat { partCount => Source( Range(firstPartIndex, partCount + firstPartIndex) ) }.flatMapConcat { partNumber => Source.fromInputStream( getS3ObjectContent(partNumber, readMemoryBufferSize), ) } // Type - Source[ByteString, NotUsed]
  • 26. gfc-aws-s3 https://github.com/gilt/gfc-aws-s3 Opensource project containing the code above (Sources and Sink) Also s3-http as an educational example Codez!