Streaming data to s3 using akka streams

•

1 like•1,327 views

This document summarizes using Akka streams to stream large database result sets to Amazon S3. The key points are: - Akka streams can handle streaming large amounts of data without overloading memory by processing data in chunks. - A stream consists of a source (database query), flow (serialization), and sink (S3 upload). - The stream serializes database rows into bytes and uploads them to S3 in parallel chunks using S3's multipart upload API to avoid timeouts. - Anorm provides an Akka stream source to query a database, and a custom S3 sink uploads chunks to S3 concurrently. Retries and error handling would be needed for production.

Software

The problem
● Several big (hundreds Mb) database result sets
● Served as a JSON files
● The service constantly OOM-ing, even on 32Gb instance

Akka-streams
● Library from akka toolbox
● Build on top of actor framework
● Handles streams and their specifics, without exposing
actors itself

A bit on akka-streams - Source
● The input of the data in the stream
● Has the output channel to feed data into the stream
SQLSource

A bit on akka streams - Sink
● The final point of the data in the stream
● Has the input channel to receive the data from the stream
S3 object

Another bit on akka-streams - Flow
● The transformation procedure of the stream
● Takes data from the input, apply some computations to it,
and pass the resulting data to the output
Serialization

Basic stream operations
● via
Source via Flow =>
Source
Flow via Flow =>
Flow
● to
Flow to Sink =>
Sink
Source to Sink =>
Sink

Declaration is not execution!
Stream description is just a declaration, so:
val s = Source[Int](Range(1, 100).toList)
.via(
Flow[Int].map(x => x + 10)
).to(
Sink.foreach(println)
)
will not execute until you call
s.run()

The skeleton
Get data -> serialize -> send to S3
def run(): Future[Long] = {
val cn = getConnection()
val stream = (cn: Connection) =>
dataSource.streamList(cn) // Source[Item] - get data from the DB
.via(serializeFlow) // Flow[Item, Byte] - serialize
.toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3
val countFuture = stream(cn).run()
countFuture.onComplete { r =>
cn.close()
}
countFuture
}

Serialize in the stream
● We deal with the single collection
● Type of the items is the same
val serializeFlow = Flow[Item]
.map(x => serializeItem(x)) // serializeItem: Item => String
.intersperse("[", ",", "]") // sort of mkString for the streams
.mapConcat[Byte] {
x => x.getBytes().toIndexedSeq
}

S3 multipart upload API
● Allows to upload files in separate chunks
● Allows to upload chunks in parallel
● Doesn’t have TTL for the chunks uploaded (by default)
Simplified methods:
1. initialize(bucket, filename) => uploadId
2. uploadChunk(uploadId, partNumber, content) => hashSum
3. complete()

Lets create an S3 Sink!
● SinkA = Flow to SinkB
S3 upload flow
Sink.head
(first value received)
S3 upload sink

S3 upload sink
Flow[Byte]
.grouped(chunkSize) //split the stream in chunks
.zip(Source.fromIterator(() => Iterator.from(1))) //Give the chunks numbers
.fold[MultipartUploader] ( //Fold over uploader state
initUploader() //initial value - uploader
) {
case (uploader, (data, chunkNumber)) => //reduce - returns uploader (!)
uploader.uploadChunk(chunkNumber, data.toArray)
}.map {
uploader => uploader.complete() //close the uploader on completion
}
.to(Sink.head)

SQL Source
Anorm provides akka-stream SQL source
libraryDependencies ++= Seq(
"com.typesafe.play" %% "anorm-akka" % "version",
"com.typesafe.akka" %% "akka-stream" % "version")
AkkaStream.source(SQL"SELECT * FROM Test",
SqlParser.scalar[String], ColumnAliaser.empty): Source[String]
Brings minimal transitive dependencies (!)

Road to production
● Retries in case of S3 errors/failures
● Handle the possible problem during stream execution (ie.
failure talking to DB)

ReactiveCocoa is an elegant framework that radically changes the way we structure our applications and handle flows of data. However, it's beauty is somewhat marred by Objective-C! In this talk Colin will cover the basics of ReactiveCocoa and the principles of Functional Reactive Programming. Through simple practical examples he will show how ReactiveCocoa and Swift form a beautiful partnership.

Scalable Applications with ScalaNimrod Argov

My Gentle Introduction to RxJS

Mattia Occhiuto

Apache Flink Training Workshop @ HadoopCon2016 - #2 DataSet API Hands-On

Apache Flink Taiwan User Group

Reactive streaming is becoming the best approach to handle data flows across asynchronous boundaries. Here, we present the implementation of a real-world application based on Akka Streams. After reviewing the basics, we will discuss the development of a data processing pipeline that collects real-time sensor data and sends it to a Kinesis stream. There are various possible point of failures in this architecture. What should happen when Kinesis is unavailable? If the data flow is not handled in the correct way, some information may get lost. Akka Streams are the tools that enabled us to build a reliable processing logic for the pipeline that avoids data losses and maximizes the robustness of the entire system.

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

Synchronize OpenLDAP with Active Directory with LSC projectClément OUDOT

Flink Batch Processing and Iterations

Sameer Wadkar

Concurrency on the JVMVaclav Pech

RxJS - The Reactive Extensions for JavaScript

Viliam Elischer

Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview

Apache Flink Taiwan User Group

Apache Flink internals

Kostas Tzoumas

Introduction to RxJS

Brainhub

Monads - Dublin Scala meetup

Mikhail Girkin

CQRS + ES with Scala and Akka

Bharadwaj N

What's hot

Intro to ReactiveCocoa

kleneau

Distributed computing with spark

Javier Santos Paniego

My first experience with lambda expressions in javaScheidt & Bachmann

Apache Flink @ NYC Flink Meetup

Stephan Ewen

Akka streams - Umeå java usergroup

Johan Andrén

Streaming all the things with akka streams

Johan Andrén

Gpars workshopVaclav Pech

GPars howto - when to use which concurrency abstractionVaclav Pech

Introduction to rx java for android

Esa Firman

RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC

Clément OUDOT

Intro to RxJava/RxAndroid - GDG Munich Android

Egor Andreevich

A dive into akka streams: from the basics to a real-world scenario

Gioia Ballin

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

Synchronize OpenLDAP with Active Directory with LSC projectClément OUDOT

Flink Batch Processing and Iterations

Sameer Wadkar

Concurrency on the JVMVaclav Pech

RxJS - The Reactive Extensions for JavaScript

Viliam Elischer

Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview

Apache Flink Taiwan User Group

Apache Flink internals

Kostas Tzoumas

Introduction to RxJS

Brainhub

What's hot (20)

Intro to ReactiveCocoa

Distributed computing with spark

My first experience with lambda expressions in java

Apache Flink @ NYC Flink Meetup

Akka streams - Umeå java usergroup

Streaming all the things with akka streams

Gpars workshop

GPars howto - when to use which concurrency abstraction

Introduction to rx java for android

RMLL 2013 - Synchronize OpenLDAP and Active Directory with LSC

Intro to RxJava/RxAndroid - GDG Munich Android

A dive into akka streams: from the basics to a real-world scenario

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Synchronize OpenLDAP with Active Directory with LSC project

Flink Batch Processing and Iterations

Concurrency on the JVM

RxJS - The Reactive Extensions for JavaScript

Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview

Apache Flink internals

Introduction to RxJS

Viewers also liked

Monads - Dublin Scala meetup

Mikhail Girkin

CQRS + ES with Scala and Akka

Bharadwaj N

Akka: Введение

Iosif Itkin

Akka Fundamentals

Michael Kendra

Akkaships: "Primeros pasos con Akka: Olvídate de los threads"

Miguel Angel Fernandez Diaz

Akka Streams in Action @ ScalaDays Berlin 2016

Konrad Malawski

JavaOne: A tour of (advanced) akka features in 60 minutes [con1706]

Johan Janssen

Akka is a very interesting and powerful framework that can be used to build high-performance applications. But what can you do with Akka? This session starts with the basics and then covers some more-advanced topics such as finite-state machines, Akka HTTP, remote actors, clustering, routing, sharing, and persistence. The presentation includes a demo done on a Raspberry Pi Akka cluster. After this session, you’ll know what is possible with Akka and will be able to start using those features yourself.

Akka stream

Masaki Toyoshima

How Reactive Streams & Akka Streams change the JVM Ecosystem

Konrad Malawski

Akka streams

mircodotta

Akka Streams is an implementation of Reactive Streams, which is a standard for asynchronous stream processing with non-blocking backpressure on the JVM. In this talk we'll cover the rationale behind Reactive Streams, and explore the different building blocks available in Akka Streams. I'll use Scala for all coding examples, but Akka Streams also provides a full-fledged Java8 API.After this session you will be all set and ready to reap the benefits of using Akka Streams!

End to End Akka Streams / Reactive Streams - from Business to Socket

Konrad Malawski

The Reactive Streams specification, along with its TCK and various implementations such as Akka Streams, is coming closer and closer with the inclusion of the RS types in JDK 9. Using an example Twitter-like streaming service implementation, this session shows why this is a game changer in terms of how you can design reactive streaming applications by connecting pipelines of back-pressured asynchronous processing stages. The presentation looks at the example from two perspectives: a raw implementation and an implementation addressing a high-level business need.

Reactive Streams, j.u.concurrent & Beyond!

Konrad Malawski

Reactive Streams are a cross-company initiative first ignited by Lightbend in 2013, soon to be joined by RxJava and other implementations focused on solving a very similar problem: asynchronous non-blocking stream processing, with guaranteed over-flow protection. Fast forward to 2016 and now these interfaces are part of JSR-266 and proposed for JDK9. In this talk we'll first disambiguate what the word Stream means in this context (as it's been overloaded recently by various different meanings), then look at how its protocol works and how one might use it in the real world showing examples using existing implementations. We'll also have a peek into the future, to see what the next steps for such collaborative protocols and the JDK ecosystem are in general.

Reactive Streams 1.0.0 and Why You Should Care (webinar)

Legacy Typesafe (now Lightbend)

In this presentation, Akka Team Lead and author Roland Kuhn presents the freshly released final specification for Reactive Streams on the JVM. This work was done in collaboration with engineers representing Netflix, Red Hat, Pivotal, Oracle, Typesafe and others to define a standard for passing streams of data between threads in an asynchronous and non-blocking fashion. This is a common need in Reactive systems, where handling streams of "live" data whose volume is not predetermined. The most prominent issue facing the industry today is that resource consumption needs to be controlled such that a fast data source does not overwhelm the stream destination. Asynchrony is needed in order to enable the parallel use of computing resources, on collaborating network hosts or multiple CPU cores within a single machine. Here we'll review the mechanisms employed by Reactive Streams, discuss the applicability of this technology to a variety of problems encountered in day to day work on the JVM, and give an overview of the tooling ecosystem that is emerging around this young standard.

Reactive Stream Processing with Akka StreamsKonrad Malawski

Akka Streams and HTTP

Roland Kuhn

Akka Streams are an implementation of the Reactive Streams specification (http://reactive-streams.org/), a joint effort that aims at standardizing the exchange of streams of data across asynchronous boundaries in a fully non-blocking way while providing flow control and mediating back pressure. In this presentation we go into the details of what this new abstraction can be used for and what the guiding principles are behind its development. We then focus on one prominent use-case which is the upcoming Akka HTTP module: a fully stream-enabled, reactive HTTP server and client implementation.

Akka Cluster and Auto-scaling

Ikuo Matsumura

Vert.x vs akka

Chang-Hwan Han

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures

Lightbend

The term 'streams' has been getting pretty overloaded recently–it's hard to know where to best use different technologies with streams in the name. In this talk by noted hAkker Konrad Malawski, we'll disambiguate what streams are and what they aren't, taking a deeper look into Akka Streams (the implementation) and Reactive Streams (the standard). You'll be introduced to a number of real life scenarios where applying back-pressure helps to keep your systems fast and healthy at the same time. While the focus is mainly on the Akka Streams implementation, the general principles apply to any kind of asynchronous, message-driven architectures.

Akka Finite State Machine

Knoldus Inc.

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Todd Fritz

In this session, we will discuss: * reactive architecture tenets * distributed “fast data” streams * application and analytics focused Data Lake Enterprise level concerns and the importance of holistic governance, operational management, and a Metadata Lake will be conceptually investigated. The next level of detail will be to explore what a prospective architecture looks like at scale with Terabytes of ingestion per day, how scale puts pressure on an architecture, and how to be successful without losing data in a mission critical system via resilient, self-healing, scalable technologies. DevOps and application architecture concerns will be first-class themes throughout. Reactive principles and technology will be the second act of this talk. Kafka. Akka. Spark. Various streaming technologies (Kafka Streams, Akka Streams, Spark Streaming) will be reviewed to identify what they are best suited for. The fast data pipeline discussion will center around Kafka, Akka, and Apache Flink (Lightbend Fast Data platform). We’ll also walk through an exciting addition to the Akka family, Alpakka, which is a Camel equivalent for Enterprise Integration Patterns. The final act will be to dive into the Data Lake, from both an analytics and application development perspective. Technologies used to explain concepts will include Amazon and Hadoop. A Data Lake may service multiple analytics consumers with various “views” (and access levels) of data. It may also be a participant of various applications, perhaps by acting as a centralized source for reference data or common middleware (in turn feeding the analytics aspect). The concept of the Metadata Lake to apply structure, meaning and purpose will be an over-arching success factor for a Data Lake. The difference between the Data Lake and Metadata Lake is conceptually similar to a Halocline… Various technologies (Iglu/Snowplow and more) will be discussed from a feature standpoint to flesh out the technology capabilities needed for Data Lake governance.

Viewers also liked (20)

Monads - Dublin Scala meetup

CQRS + ES with Scala and Akka

Akka: Введение

Akka Fundamentals

Akkaships: "Primeros pasos con Akka: Olvídate de los threads"

Akka Streams in Action @ ScalaDays Berlin 2016

JavaOne: A tour of (advanced) akka features in 60 minutes [con1706]

Akka stream

How Reactive Streams & Akka Streams change the JVM Ecosystem

Akka streams

End to End Akka Streams / Reactive Streams - from Business to Socket

Reactive Streams, j.u.concurrent & Beyond!

Reactive Streams 1.0.0 and Why You Should Care (webinar)

Reactive Stream Processing with Akka Streams

Akka Streams and HTTP

Akka Cluster and Auto-scaling

Vert.x vs akka

Understanding Akka Streams, Back Pressure, and Asynchronous Architectures

Akka Finite State Machine

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Similar to Streaming data to s3 using akka streams

Stream processing from single node to a cluster

Gal Marder

Google cloud Dataflow & Apache Flink

Iván Fernández Perea

Intro to Akka Streams

Michael Kendra

Productizing Structured Streaming Jobs

Databricks

"Structured Streaming was a new streaming API introduced to Spark over 2 years ago in Spark 2.0, and was announced GA as of Spark 2.2. Databricks customers have processed over a hundred trillion rows in production using Structured Streaming. We received dozens of questions on how to best develop, monitor, test, deploy and upgrade these jobs. In this talk, we aim to share best practices around what has worked and what hasn't across our customer base. We will tackle questions around how to plan ahead, what kind of code changes are safe for structured streaming jobs, how to architect streaming pipelines which can give you the most flexibility without sacrificing performance by using tools like Databricks Delta, how to best monitor your streaming jobs and alert if your streams are falling behind or are actually failing, as well as how to best test your code."

CBStreams - Java Streams for ColdFusion (CFML)

Ortus Solutions, Corp

Welcome to the wonderful world of Java Streams ported for the CFML world!The beauty of streams is that the elements in a stream are processed and passed across the processing pipeline. Unlike traditional CFML functions like map(), reduce() and filter() which create completely new collections until all items in the pipeline are processed. With streams, the elements are streamed across the pipeline to increase efficiency and performance.

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...

Ortus Solutions, Corp

This session will introduce the cbStreams module. It will discuss what Java streams are, each of the available methods and options, and how to implement cbStreams into their applications. With real-world examples of stream implementation, this session will also show how using streams can enhance the performance of your application and reduce latency. Target Audience: Anyone wishing to learn about Java streams.

Streaming Data with scalaz-stream

GaryCoady

Streaming sql w kafka and flink

Kenny Gorman

cb streams - gavin pickin

Ortus Solutions, Corp

PSUG #52 Dataflow and simplified reactive programming with Akka-streams

Stephane Manciot

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Guido Schmutz

Independent of the source of data, the integration and analysis of event streams gets more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analyzed, often with many consumers or systems interested in all or part of the events. In this session we compare two popular Streaming Analytics solutions: Spark Streaming and Kafka Streams. Spark is fast and general engine for large-scale data processing and has been designed to provide a more efficient alternative to Hadoop MapReduce. Spark Streaming brings Spark's language-integrated API to stream processing, letting you write streaming applications the same way you write batch jobs. It supports both Java and Scala. Kafka Streams is the stream processing solution which is part of Kafka. It is provided as a Java library and by that can be easily integrated with any Java application.

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...

Lightbend

Akka Streams and its amazing handling of streaming with back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially in use cases where the amount of work continues to increase as you’re processing it. This is where back-pressure really shines. In this talk for Architects and Dev Managers by Akara Sucharitakul, Principal MTS for Global Platform Frameworks at PayPal, Inc., we look at how back-pressure based on Akka Streams and Kafka is being used at PayPal to handle very bursty workloads. In addition, Akara will also share experiences in creating a platform based on Akka and Akka Streams that currently processes over 1 billion transactions per day (on just 8 VMs), with the aim of helping teams adopt these technologies. In this webinar, you will: *Start with a sample web crawler use case to examine what happens when each processing pass expands to a larger and larger workload to process. *Review how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts. *Look at lessons learned, plus some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.

Event Driven Microservices

Fabrizio Fortino

Journey into Reactive Streams and Akka Streams

Kevin Webber

Intel realtime analytics_sparkGeetanjali G

Elk with Openstack

Arun prasath

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...

Reactivesummit

Akka Streams and its amazing handling of stream back-pressure should be no surprise to anyone. But it takes a couple of use cases to really see it in action - especially use cases where the amount of work increases as you process make you really value the back-pressure. This talk takes a sample web crawler use case where each processing pass expands to a larger and larger workload to process, and discusses how we use the buffering capabilities in Kafka and the back-pressure with asynchronous processing in Akka Streams to handle such bursts. In addition, we will also provide some constructive “rants” about the architectural components, the maturity, or immaturity you’ll expect, and tidbits and open source goodies like memory-mapped stream buffers that can be helpful in other Akka Streams and/or Kafka use cases.

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Akara Sucharitakul

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Guido Schmutz

So you think you can stream.pptx

Prakash Chockalingam

Similar to Streaming data to s3 using akka streams (20)

Stream processing from single node to a cluster

Google cloud Dataflow & Apache Flink

Intro to Akka Streams

Productizing Structured Streaming Jobs

CBStreams - Java Streams for ColdFusion (CFML)

ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...

Streaming Data with scalaz-stream

Streaming sql w kafka and flink

cb streams - gavin pickin

PSUG #52 Dataflow and simplified reactive programming with Akka-streams

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And...

Event Driven Microservices

Journey into Reactive Streams and Akka Streams

Intel realtime analytics_spark

Elk with Openstack

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Ka...

Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...

So you think you can stream.pptx

Recently uploaded

BoxLang: Review our Visionary Licenses of 2024

Ortus Solutions, Corp

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

XfilesPro

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Shahin Sheidaei

Games are powerful teaching tools, fostering hands-on engagement and fun. But they require careful consideration to succeed. Join me to explore factors in running and selecting games, ensuring they serve as effective teaching tools. Learn to maintain focus on learning objectives while playing, and how to measure the ROI of gaming in education. Discover strategies for pitching gaming to leadership. This session offers insights, tips, and examples for coaches, team leads, and enterprise leaders seeking to teach from simple to complex concepts.

Vitthal Shirke Java Microservices Resume.pdf

Vitthal Shirke

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

Natan Silnitsky

In software engineering, the right architecture is essential for robust, scalable platforms. Wix has undergone a pivotal shift from event sourcing to a CRUD-based model for its microservices. This talk will chart the course of this pivotal journey. Event sourcing, which records state changes as immutable events, provided robust auditing and "time travel" debugging for Wix Stores' microservices. Despite its benefits, the complexity it introduced in state management slowed development. Wix responded by adopting a simpler, unified CRUD model. This talk will explore the challenges of event sourcing and the advantages of Wix's new "CRUD on steroids" approach, which streamlines API integration and domain event management while preserving data integrity and system resilience. Participants will gain valuable insights into Wix's strategies for ensuring atomicity in database updates and event production, as well as caching, materialization, and performance optimization techniques within a distributed system. Join us to discover how Wix has mastered the art of balancing simplicity and extensibility, and learn how the re-adoption of the modest CRUD has turbocharged their development velocity, resilience, and scalability in a high-growth environment.

May Marketo Masterclass, London MUG May 22 2024.pdf

Adele Miller

How Recreation Management Software Can Streamline Your Operations.pptx

wottaspaceseo

Recreation management software streamlines operations by automating key tasks such as scheduling, registration, and payment processing, reducing manual workload and errors. It provides centralized management of facilities, classes, and events, ensuring efficient resource allocation and facility usage. The software offers user-friendly online portals for easy access to bookings and program information, enhancing customer experience. Real-time reporting and data analytics deliver insights into attendance and preferences, aiding in strategic decision-making. Additionally, effective communication tools keep participants and staff informed with timely updates. Overall, recreation management software enhances efficiency, improves service delivery, and boosts customer satisfaction.

Globus Compute wth IRI Workflows - GlobusWorld 2024

Globus

As part of the DOE Integrated Research Infrastructure (IRI) program, NERSC at Lawrence Berkeley National Lab and ALCF at Argonne National Lab are working closely with General Atomics on accelerating the computing requirements of the DIII-D experiment. As part of the work the team is investigating ways to speedup the time to solution for many different parts of the DIII-D workflow including how they run jobs on HPC systems. One of these routes is looking at Globus Compute as a way to replace the current method for managing tasks and we describe a brief proof of concept showing how Globus Compute could help to schedule jobs and be a tool to connect compute at different facilities.

2024 RoOUG Security model for the cloud.pptx

Georgi Kodinov

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

AMB-Review

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos https://www.amb-review.com/tubetrivia-ai Exclusive Features: AI-Powered Questions, Wide Range of Categories, Adaptive Difficulty, User-Friendly Interface, Multiplayer Mode, Regular Updates. #TubeTriviaAI #QuizVideoMagic #ViralQuizVideos #AIQuizGenerator #EngageExciteExplode #MarketingRevolution #BoostYourTraffic #SocialMediaSuccess #AIContentCreation #UnlimitedTraffic

Graphic Design Crash Course for beginners

e20449

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Globus

JASMIN is the UK’s high-performance data analysis platform for environmental science, operated by STFC on behalf of the UK Natural Environment Research Council (NERC). In addition to its role in hosting the CEDA Archive (NERC’s long-term repository for climate, atmospheric science & Earth observation data in the UK), JASMIN provides a collaborative platform to a community of around 2,000 scientists in the UK and beyond, providing nearly 400 environmental science projects with working space, compute resources and tools to facilitate their work. High-performance data transfer into and out of JASMIN has always been a key feature, with many scientists bringing model outputs from supercomputers elsewhere in the UK, to analyse against observational or other model data in the CEDA Archive. A growing number of JASMIN users are now realising the benefits of using the Globus service to provide reliable and efficient data movement and other tasks in this and other contexts. Further use cases involve long-distance (intercontinental) transfers to and from JASMIN, and collecting results from a mobile atmospheric radar system, pushing data to JASMIN via a lightweight Globus deployment. We provide details of how Globus fits into our current infrastructure, our experience of the recent migration to GCSv5.4, and of our interest in developing use of the wider ecosystem of Globus services for the benefit of our user community.

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Globus

The U.S. Geological Survey (USGS) has made substantial investments in meeting evolving scientific, technical, and policy driven demands on storing, managing, and delivering data. As these demands continue to grow in complexity and scale, the USGS must continue to explore innovative solutions to improve its management, curation, sharing, delivering, and preservation approaches for large-scale research data. Supporting these needs, the USGS has partnered with the University of Chicago-Globus to research and develop advanced repository components and workflows leveraging its current investment in Globus. The primary outcome of this partnership includes the development of a prototype enterprise repository, driven by USGS Data Release requirements, through exploration and implementation of the entire suite of the Globus platform offerings, including Globus Flow, Globus Auth, Globus Transfer, and Globus Search. This presentation will provide insights into this research partnership, introduce the unique requirements and challenges being addressed and provide relevant project progress.

Launch Your Streaming Platforms in Minutes

Roshan Dwivedi

The claim of launching a streaming platform in minutes might be a bit of an exaggeration, but there are services that can significantly streamline the process. Here's a breakdown: Pros of Speedy Streaming Platform Launch Services: No coding required: These services often use drag-and-drop interfaces or pre-built templates, eliminating the need for programming knowledge. Faster setup: Compared to building from scratch, these platforms can get you up and running much quicker. All-in-one solutions: Many services offer features like content management systems (CMS), video players, and monetization tools, reducing the need for multiple integrations. Things to Consider: Limited customization: These platforms may offer less flexibility in design and functionality compared to custom-built solutions. Scalability: As your audience grows, you might need to upgrade to a more robust platform or encounter limitations with the "quick launch" option. Features: Carefully evaluate which features are included and if they meet your specific needs (e.g., live streaming, subscription options). Examples of Services for Launching Streaming Platforms: Muvi [muvi com] Uscreen [usencreen tv] Alternatives to Consider: Existing Streaming platforms: Platforms like YouTube or Twitch might be suitable for basic streaming needs, though monetization options might be limited. Custom Development: While more time-consuming, custom development offers the most control and flexibility for your platform. Overall, launching a streaming platform in minutes might not be entirely realistic, but these services can significantly speed up the process compared to building from scratch. Carefully consider your needs and budget when choosing the best option for you.

Large Language Models and the End of Programming

Matt Welsh

Globus Connect Server Deep Dive - GlobusWorld 2024

Globus

Prosigns: Transforming Business with Tailored Technology Solutions

Prosigns

Unlocking Business Potential: Tailored Technology Solutions by Prosigns Discover how Prosigns, a leading technology solutions provider, partners with businesses to drive innovation and success. Our presentation showcases our comprehensive range of services, including custom software development, web and mobile app development, AI & ML solutions, blockchain integration, DevOps services, and Microsoft Dynamics 365 support. Custom Software Development: Prosigns specializes in creating bespoke software solutions that cater to your unique business needs. Our team of experts works closely with you to understand your requirements and deliver tailor-made software that enhances efficiency and drives growth. Web and Mobile App Development: From responsive websites to intuitive mobile applications, Prosigns develops cutting-edge solutions that engage users and deliver seamless experiences across devices. AI & ML Solutions: Harnessing the power of Artificial Intelligence and Machine Learning, Prosigns provides smart solutions that automate processes, provide valuable insights, and drive informed decision-making. Blockchain Integration: Prosigns offers comprehensive blockchain solutions, including development, integration, and consulting services, enabling businesses to leverage blockchain technology for enhanced security, transparency, and efficiency. DevOps Services: Prosigns' DevOps services streamline development and operations processes, ensuring faster and more reliable software delivery through automation and continuous integration. Microsoft Dynamics 365 Support: Prosigns provides comprehensive support and maintenance services for Microsoft Dynamics 365, ensuring your system is always up-to-date, secure, and running smoothly. Learn how our collaborative approach and dedication to excellence help businesses achieve their goals and stay ahead in today's digital landscape. From concept to deployment, Prosigns is your trusted partner for transforming ideas into reality and unlocking the full potential of your business. Join us on a journey of innovation and growth. Let's partner for success with Prosigns.

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

takuyayamamoto1800

Cracking the code review at SpringIO 2024

Paco van Beckhoven

Code reviews are vital for ensuring good code quality. They serve as one of our last lines of defense against bugs and subpar code reaching production. Yet, they often turn into annoying tasks riddled with frustration, hostility, unclear feedback and lack of standards. How can we improve this crucial process? In this session we will cover: - The Art of Effective Code Reviews - Streamlining the Review Process - Elevating Reviews with Automated Tools By the end of this presentation, you'll have the knowledge on how to organize and improve your code review proces

Enhancing Research Orchestration Capabilities at ORNL.pdf

Globus

Cross-facility research orchestration comes with ever-changing constraints regarding the availability and suitability of various compute and data resources. In short, a flexible data and processing fabric is needed to enable the dynamic redirection of data and compute tasks throughout the lifecycle of an experiment. In this talk, we illustrate how we easily leveraged Globus services to instrument the ACE research testbed at the Oak Ridge Leadership Computing Facility with flexible data and task orchestration capabilities.

Recently uploaded (20)

BoxLang: Review our Visionary Licenses of 2024

Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better

Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...

Vitthal Shirke Java Microservices Resume.pdf

Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL

May Marketo Masterclass, London MUG May 22 2024.pdf

How Recreation Management Software Can Streamline Your Operations.pptx

Globus Compute wth IRI Workflows - GlobusWorld 2024

2024 RoOUG Security model for the cloud.pptx

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf

Graphic Design Crash Course for beginners

Providing Globus Services to Users of JASMIN for Environmental Data Analysis

Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...

Launch Your Streaming Platforms in Minutes

Large Language Models and the End of Programming

Globus Connect Server Deep Dive - GlobusWorld 2024

Prosigns: Transforming Business with Tailored Technology Solutions

OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam

Cracking the code review at SpringIO 2024

Enhancing Research Orchestration Capabilities at ORNL.pdf

Streaming data to s3 using akka streams

1. Streaming data to S3 using akka-streams Mikhail Girkin Software Engineer GILT HBC Digital @mike_girkin

2. The problem ● Several big (hundreds Mb) database result sets ● Served as a JSON files ● The service constantly OOM-ing, even on 32Gb instance

3. Akka-streams ● Library from akka toolbox ● Build on top of actor framework ● Handles streams and their specifics, without exposing actors itself

4. A bit on akka-streams - Source ● The input of the data in the stream ● Has the output channel to feed data into the stream SQLSource

5. A bit on akka streams - Sink ● The final point of the data in the stream ● Has the input channel to receive the data from the stream S3 object

6. Another bit on akka-streams - Flow ● The transformation procedure of the stream ● Takes data from the input, apply some computations to it, and pass the resulting data to the output Serialization

7. Basic stream operations ● via Source via Flow => Source Flow via Flow => Flow ● to Flow to Sink => Sink Source to Sink => Sink

8. Declaration is not execution! Stream description is just a declaration, so: val s = Source[Int](Range(1, 100).toList) .via( Flow[Int].map(x => x + 10) ).to( Sink.foreach(println) ) will not execute until you call s.run()

9. The skeleton Get data -> serialize -> send to S3 def run(): Future[Long] = { val cn = getConnection() val stream = (cn: Connection) => dataSource.streamList(cn) // Source[Item] - get data from the DB .via(serializeFlow) // Flow[Item, Byte] - serialize .toMat(s3UploaderSink)(Keep.right) // Sink[Byte] - upload to S3 val countFuture = stream(cn).run() countFuture.onComplete { r => cn.close() } countFuture }

10. Serialize in the stream ● We deal with the single collection ● Type of the items is the same val serializeFlow = Flow[Item] .map(x => serializeItem(x)) // serializeItem: Item => String .intersperse("[", ",", "]") // sort of mkString for the streams .mapConcat[Byte] { x => x.getBytes().toIndexedSeq }

11. S3 multipart upload API ● Allows to upload files in separate chunks ● Allows to upload chunks in parallel ● Doesn’t have TTL for the chunks uploaded (by default) Simplified methods: 1. initialize(bucket, filename) => uploadId 2. uploadChunk(uploadId, partNumber, content) => hashSum 3. complete()

12. Lets create an S3 Sink! ● SinkA = Flow to SinkB S3 upload flow Sink.head (first value received) S3 upload sink

13. S3 upload sink Flow[Byte] .grouped(chunkSize) //split the stream in chunks .zip(Source.fromIterator(() => Iterator.from(1))) //Give the chunks numbers .fold[MultipartUploader] ( //Fold over uploader state initUploader() //initial value - uploader ) { case (uploader, (data, chunkNumber)) => //reduce - returns uploader (!) uploader.uploadChunk(chunkNumber, data.toArray) }.map { uploader => uploader.complete() //close the uploader on completion } .to(Sink.head)

14. SQL Source Anorm provides akka-stream SQL source libraryDependencies ++= Seq( "com.typesafe.play" %% "anorm-akka" % "version", "com.typesafe.akka" %% "akka-stream" % "version") AkkaStream.source(SQL"SELECT * FROM Test", SqlParser.scalar[String], ColumnAliaser.empty): Source[String] Brings minimal transitive dependencies (!)

15. Road to production ● Retries in case of S3 errors/failures ● Handle the possible problem during stream execution (ie. failure talking to DB)

16. 200 OK

Streaming data to s3 using akka streams

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Streaming data to s3 using akka streams

Similar to Streaming data to s3 using akka streams (20)

Recently uploaded

Recently uploaded (20)

Streaming data to s3 using akka streams