Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Building Serverless Data Pipelines - Serverless Toronto User Group


Published on

Data pipelines are the foundation of any cutting edge tech product. It does not matter if you are building a web app, a mobile app, a video game or an AI Recommender system, it needs data to run. As an example let's say your building an AI Recommender system that tells you which stock to invest in. The AI providing the suggestion needs to have data on top of which you can train it. If you have heard of MLOps it is basically a fancy new buzzword used for describing a data-pipeline that feeds data to an ML system.

Presenter Bio: Ivan Vukasinovic is the founder of Devinity Corp and ex-Amazonian.

Sponsors: Want to thank and for making this Serverless Toronto event possible & for donating NOT 1 BUT 2 eBooks this month!

More info about the event:

Join community to learn how to leverage Serverless Architectures in order to innovate faster, and make your apps scalable & cost-effective!

Published in: Software
  • Login to see the comments

  • Be the first to like this

Building Serverless Data Pipelines - Serverless Toronto User Group

  1. 1. Dec 13, 2018 Meetup
  2. 2. Thursday, Dec 13, 2018 Meetup Agenda 1. Intro & Announcements 2. Stand Up Introductions 3. Presentater Ivan Vukasinovic: Building Serverless Data Pipelines 4. Networking 2
  3. 3. 2 Manning Publications giveaways today: 1. Serverless Applications with Node.js - Slobodan Stojanović & Aleksandar Simović 2. Production-Ready Serverless - Yan Cui ready-serverless 3. Serverless Applications with AWS - Marcia Villalba 4. Serverless Architectures on AWS, 2nd Ed - Peter Sbarski 5. Voice Applications for Alexa and Google Assistant - Dustin Coates 6. The Quick Python Book, Third Edition - Naomi Ceder 7. Google Cloud Platform in Action - JJ Geewax [PRINT] 3
  4. 4. © 2018 Cloudinary Inc., Confidential Information. Do Not Distribute. Challenges for Web/Mobile Developers 4 Handling multiple image types across different browsers: JPEG-XR, WebP, etc. Page load times and impact on SEO, conversion rates and other business metrics Automating the management of user generated content Optimizing images and videos for faster loading Adapting media for responsive delivery across different viewports About our catering sponsor:
  5. 5. © 2018 Cloudinary Inc., Confidential Information. Do Not Distribute. The Media Full Stack Image & Video Management Platform 5 Upload Storage Administration Manipulation Delivery AUTOMATIC OPTIMIZATION Format Crop DPR QualityResize Encryption Multi-Region Backup Revision History Disaster Recovery API Widget Remote Fetch Web Interface Overlays Effects & Filters Face Detection Responsive OCR API Analytics Access Control Search Web Console DAM Multi-CDN Your Own CDN
  6. 6. © 2018 Cloudinary Inc., Confidential Information. Do Not Distribute. DAM CMS 100% Open APIs Workflow 3RD Party PIM AI, ML The Cloudinary Image & Video Management Platform Wide range of SDKs
  7. 7. 5 Second Introduction 7
  8. 8. Serverless Data Pipelines Development Infinity Ivan Vukasinovic Founder @ Devinity
  9. 9. Who is this guy ? Devinity ??? ● Living Cliche of “Wrote his first code before he could read” ● 10+ years of getting paid to build software ● Serial Entrepreneurial Failure 3 times and counting ● See photos --> ● Devinity started as group of techies that turned into a community and finally a company ● We build our own products as well so we know how to startup ● Stealth Mode, no photos :)
  10. 10. What’s this about ? ● A love story between Serverless, Data Pipelines and Startups ● Serverless: Architecture & Framework ● Data pipelines(ETLs): Complex systems that process and store all those Petabytes of data ● Startups: 21st century version of inventors, world shakers, boundary breakers, etc...
  11. 11. Data Pipeline ● In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. ● Short version: It’s a bunch of code that retrieves, processes and stores data, nicely. ● Hopefully without leaking too much
  12. 12. Data Pipeline Metrics ● When it comes to building a data pipeline the most important things are data processing speed and the ability to scale. ● The speed of a data pipeline is measured with a metric called time to live. ● Time to live is the time that it takes the data pipeline from the point when there was a change in the data from an external source until the data has been processed by the data pipeline. ● In order to make your system as close as possible to real time, your time to live needs to be as short as possible. ● Make sure your pipeline is cost effective, there has to be some money left for a pretty UI.
  13. 13. Data Pipeline Mechanics ● Make sure it doesn’t leak too much ● It would also be nice if you don’t spend most of your time checking for leaks ● Basically don’t build a fragile beauty that needs constant attention.
  14. 14. Serverless ● Is it a bird, is it a plane ? Its both :D ● Serverless architecture -- is a software design pattern where applications are hosted by a third-party service, eliminating the need for server software and hardware management by the developer. ● The Serverless Framework -- is a free and open-source web framework written using Node.js. Serverless is the first framework that was originally developed for building applications exclusively on AWS Lambda, a serverless computing platform provided by Amazon as a part of the Amazon Web Services.
  15. 15. Serverless Data Pipelines ● Building a Data pipeline using the serverless framework and following the serverless architecture design. ● Basically a marriage between Serverless and Data Pipelines written down in a yaml config file. ● Desired goal is to have a completely event driven, streaming, near realtime data pipeline without a single server running 24x7.
  16. 16. Data Pipelines Serverless Benefits ● Time to live is drastically cut down by using the event driven nature of the Serverless architecture. ● Using Serverless gives you the ability to scale up and down in order to handle the volume of incoming data that needs to be processed. ● Scaling is achieved again by using the event driven nature of serverless as you are reacting to events triggered by incoming data. ● The more data that comes in, the more events happen which results in more parallel instances of Serverless processes(Lambdas) to process the data.
  17. 17. Serverfull Data Pipeline ● Serverfull – every process/code runs on a server ● Scheduled – processes runs on a periodic schedule, every 5 mins, every 15 mins, etc..
  18. 18. Serverless Data Pipeline ● Serverless – No servers allowed. ● Event driven – Every process reacts, is invoked on an event that happens
  19. 19. Serverless VS Serverfull
  20. 20. Serverless VS Serverfull ● Reacting on events lowers TTL ● Cost effective you only pay for the amount of processing that you actually do ● Scales, number of Serverless processes is correlated to the amount of data coming in ● Using the Serverless framework you have the code and the Infrastructure in one place which makes operational efforts easier ● Scheduling increases TTL ● You have to pay for the servers with no regard how much processing they actually do ● In order to increase the number of processes when more data comes developer intervention is required ● The code and the infrastructure are separate parts which makes operational efforts difficult
  21. 21. Talk Less Build More ● All this talk is nice but lets see a real example ● Data pipelines are growing more and more complex ● Main data is something considered to be the base of the product ● Integrated data is something we correlate with the main data to provide more value in our product ● Analytics is getting to know our data better ● Expand your main data sources, get more data
  22. 22. Cars Search Platform ● Start simple: SFTP drop site for dealerships to provide data, normalize and store the data ● Make it fast and make it cost effective. ● Design smart! Make it modular so that you don’t have to rewrite all of it in few months.
  23. 23. Cars Search Platform ● Its a search platform so lets index some data ● The path from the source to where the users can see the data is marked as the Critical Data Path ● Make it short and simple in order for it to be quick and painless
  24. 24. Cars Search Platform ● Lets integrate some external data!!! What data, so many options ? ● Making it easy for the user to finish what he wants without the need to visit other websites ● User: “I am not buying a used car without Carproof!!!”
  25. 25. Cars Search Platform ● When it comes to data, Be Greedy!!! ● More data means more potential value for the users, your PO/PM/CEO will have an idea what to do with the data :D ● Crawling the Internet is always a way to get more data
  26. 26. Cars Search Platform ● At this point you have been dating for a couple of months it is time to get to know your Data, right ? Going through the medicine cabinet, family med. history, etc... ● Experiment a lot, average price for a model for that year, Good Deal, Bad Deal, Likely to get you *****, Bellow Average Value etc...
  27. 27. Cars Search Platform
  28. 28. Cars Search Platform ● Architecture is done building it is gonna be easy, right? ● Architecture VS Development ● Serverless empowers Infrastructure as Code ● Perpetual Beta ● Development Infinity
  29. 29. Startups, Data Pipelines and Serverless ● Decreases Operational Load on developers by scaling with the load ● Ahh, we got another million users last night somebody rush to Costco we need more servers. ● A client uploaded his whole inventory and broke our pipeline, what a dumba$$ ● Its Christmas time we need more servers!!! Can we salvage some ? Can we scale without more? ● Makes Infrastructure Less Chaotic by having Infrastructure in the Code ● Who created this Queue ? ● How do you deploy this ? ● What are all these alarms for ? ● Reduces the amount of throw away code ● So you are telling me all our code runs on one single server and we cannot split it ? Khm RoR... ● Wait so we gotta rewrite this whole thing, again ? ● Makes your infrastructure cost effective ● So why do we have 30 servers running at 0% CPU usage ? ● Reserved instances 40% cheaper than on demand, Not if they are sitting around playing Go Fish. ● So why is our infrastructure bill bigger than our payroll ?
  30. 30. Thats All Folks Thanks watching and make sure to wake up the person next to you