Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

4,850 views

Published on

Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra
----------------------------------

Presentation: @TIAD Paris -- http://tiad.io/

----------------------------------

Source code: https://github.com/rogaha/data-processing-pipeline

----------------------------------

Resources: https://www.docker.com/products/docker / https://www.docker.com/technologies/overview

Published in: Technology
  • Be the first to comment

Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra

  1. 1. Real-Time Data Processing Pipeline & Visualization with Docker, Spark, Kafka and Cassandra Roberto G. Hashioka – 2016-10-04 – TIAD – Paris
  2. 2. Personal Information • Roberto Gandolfo Hashioka • @rogaha (Github) e @rhashioka (Twitter) • Finance -> Software Engineer • Growth & Data Engineer at Docker
  3. 3. Summary • Background / Motivation • Project Goals • How to build it? • DEMO
  4. 4. Background • Gather of data from multiple sources and process them in “real-time” • Transform raw data into meaningful and useful information used to enable more effective decision-making process • Provide more visibility into trends on: 1) user behavior 2) feature engagement 3) opportunities for future investments • Data transparency and standardization
  5. 5. Project Goals • Create a data processing pipeline that can handle a huge amount of events per second • Automate the development environment — Docker compose. • Automate the remote machines management — Docker for AWS / Machine. • Reduce the time to market / time to development — New hires / new features.
  6. 6. Project / Language Stack
  7. 7. How to build it? • Step 1: Install Docker for Mac/Win and dockerize all the applications link: https://www.docker.com/products/docker
  8. 8. Exemplo de Dockerfile ----------------------------------------------------------------------------------------------------------- FROM ubuntu:14.04 MAINTAINER Roberto Hashioka (roberto@docker.com) RUN apt-get update && apt-get install -y nginx RUN echo “Hello World! #TIAD” > /usr/share/nginx/html/index.html EXPOSE 80 ------------------------------------------------------------------------------------------------------------ $ docker build –t rogaha/web_demotiad2016 . $ docker run –d –p 80:80 –-name web_demotiad2016 rogaha/web_demotiad2016
  9. 9. How to build it? • Step 2: Define your services stack with a docker-compose file
  10. 10. Docker Compose containers: web: build: . command: python app.py ports: - "5000:5000" volumes: - .:/code links: - redis environment: - PYTHONUNBUFFERED=1 redis: image: redis:latest command: redis-server --appendonly yes
  11. 11. How to build it? • Step 3: Test the applications locally from your laptop using containers
  12. 12. How to build it?
  13. 13. How to build it? • Step 4: Provision your remote servers and deploy your containers
  14. 14. How to build it?
  15. 15. How to build it? • Step 5: Scale your services with Docker swarm
  16. 16. DEMO source code: https://github.com/rogaha/data-processing-pipeline
  17. 17. Open Source Projects Used • Docker (https://github.com/docker/docker) • An open platform for distributed applications for developers and sysadmins • Apache Spark / Spark SQL (https://github.com/apache/spark) • A fast, in-memory data processing engine. Spark SQL lets you query structured data as a resilient distributed dataset (RDD) • Apache Kafka (https://github.com/apache/kafka) • A fast and scalable pub-sub messaging service • Apache Zookeeper (https://github.com/apache/zookeeper) • A distributed configuration service, synchronization service, and naming registry for large distributed systems • Apache Cassandra (https://github.com/apache/cassandra) • Scalable, high-available and distributed columnar NoSQL database • D3 (https://github.com/mbostock/d3) • A JavaScript visualization library for HTML and SVG.
  18. 18. Thanks! Questions? @rhashioka

×