Successfully reported this slideshow.
Your SlideShare is downloading. ×

Leveraging the power of SolrCloud and Spark with OpenShift

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 46 Ad

Leveraging the power of SolrCloud and Spark with OpenShift

Download to read offline

Kubernetes/Cloud-Native-Meetup September 2018, Munich : Talk by Franz Wimmer (@zalintyre, Software Engineer at QAware)

Abstract: One of the most commonly used big data processing frameworks is Apache Spark. Spark manages to process large datasets with parallelization. Solr is a search platform based on Lucene. Solr can be distributed across a cluster using ZooKeeper for configuration management. Both applications can be combined to create performant Big Data applications.
But what if you want to scale up horizonally and add a node? In a manual setup, you'd have to install the new node manually. Cluster orchestrators like OpenShift claim to solve this problem.
This talk shows how to put Spark, Solr and ZooKeeper into containers, which can then be scaled individually inside a cluster using OpenShift. We will cover OpenShift details like DeploymentConfigs, StatefulSets, Services, Routes and Persistent Volumes and install a complete, failsafe and horizontally scaleable SolrCloud / Spark / Zookeeper cluster in seconds.
You will also learn about the drawbacks and pitfalls of running Big Data applications inside an OpenShift cluster.

Kubernetes/Cloud-Native-Meetup September 2018, Munich : Talk by Franz Wimmer (@zalintyre, Software Engineer at QAware)

Abstract: One of the most commonly used big data processing frameworks is Apache Spark. Spark manages to process large datasets with parallelization. Solr is a search platform based on Lucene. Solr can be distributed across a cluster using ZooKeeper for configuration management. Both applications can be combined to create performant Big Data applications.
But what if you want to scale up horizonally and add a node? In a manual setup, you'd have to install the new node manually. Cluster orchestrators like OpenShift claim to solve this problem.
This talk shows how to put Spark, Solr and ZooKeeper into containers, which can then be scaled individually inside a cluster using OpenShift. We will cover OpenShift details like DeploymentConfigs, StatefulSets, Services, Routes and Persistent Volumes and install a complete, failsafe and horizontally scaleable SolrCloud / Spark / Zookeeper cluster in seconds.
You will also learn about the drawbacks and pitfalls of running Big Data applications inside an OpenShift cluster.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Leveraging the power of SolrCloud and Spark with OpenShift (20)

Advertisement

More from QAware GmbH (20)

Recently uploaded (20)

Advertisement

Leveraging the power of SolrCloud and Spark with OpenShift

  1. 1. FranzWimmer franz.wimmer@qaware.de @zalintyre Leveraging the power of SolrCloud and Spark with OpenShift 27.09.2018
  2. 2. 1. Introduction 2. Basic Cloud Computing 3. Used Technologies 4. Implementation 5. Security 6. Live Demo 7. Lessions learned 8. Summary
  3. 3. Franz Wimmer Software Engineer since 2018 2017 / 2018 2016 - 2018 2011 - 2016 QAware GmbH: Software Engineer Master‘s thesis @ QAware GmbH M.Sc. Computer Science @ FH Rosenheim Dual B.Sc. Computer Science @ FH Rosenheim Contact E-Mail: franz.wimmer@qaware.de QAware 3
  4. 4. 1. Introduction 2. Basic Cloud Computing 3. Used Technologies 4. Implementation 5. Security 6. Live Demo 7. Lessions learned 8. Summary
  5. 5. Definitions
  6. 6. Definition: BigData Franz Wimmer 6 „Big Data refers to the inability of traditional data architectures to efficiently handle the new datasets.” − National Institute of Standards and Technology, USA Four Characteristics of „BigData”: Volume Variety Velocity Variability Related Technologies [War13]: NoSQL MapReduce Machine Learning
  7. 7. Definition: Cloud Computing Franz Wimmer 7 Providing compute power / storage / applications over a network Characteristics [Mel11]: Automated management Network access Resource pools Monitoring of resource consumption
  8. 8. SaaS PaaS IaaS Software as a Service - Ready-to-use applications Platform as a Service - Development tools - Databases / APIs Infrastructure as a Service - Hardware - Network User Developer Operations Target audienceType Cloud Computing: Bereitstellungsmodelle Franz Wimmer 8
  9. 9. SaaS PaaS IaaS Software as a Service - Ready-to-use applications Platform as a Service - Development tools - Databases / APIs Infrastructure as a Service - Hardware - Network User Developer Operations Target audienceType Cloud Computing: Bereitstellungsmodelle Franz Wimmer 9
  10. 10. Motivation
  11. 11. Size of data keeps growing Compute power doesn‘t scale at the same amount Parallelization Scale out Diagram: [Kac] Size of data vs. compute power (global) Franz Wimmer 11
  12. 12. 1. Introduction 2. Basic Cloud Computing 3. UsedTechnologies 4. Implementation 5. Security 6. Live Demo 7. Lessions learned 8. Summary
  13. 13. Cloud Technologies
  14. 14. Docker Franz Wimmer 14 Virtualization software Puts your applications into containers Virtualization with Linux resources Overhead: 0-4% (VM: 17-22%) Hardware Host-Betriebssystem Docker Bibliotheken Anwendung Bibliotheken Anwendung Hardware Hypervisor Bibliotheken Anwendung Gast-OS Bibliotheken Anwendung Gast-OS Container Virtuelle Maschine
  15. 15. FROM python:2.7-alpine COPY ./spark-ui-proxy.py / ENV SERVER_PORT=8080 ENV BIND_ADDR="0.0.0.0" EXPOSE 8080 ENTRYPOINT ["python", "/spark-ui-proxy.py"] Docker Franz Wimmer 15 Docker compiles applications into images Dockerfiles describe images Images can be started as a container as often as desired
  16. 16. Containers: not enough for a production system Franz Wimmer 16 Image: https://twitter.com/mfdii/status/697532387240996864
  17. 17. Kubernetes Franz Wimmer 17 Cluster-Orchestrator Container management Container can be grouped to pods Features: Scaling and load balancing Overlay network Service discovery Health checks Management of persistent storage Reactive Manifesto: Pods can be terminated any time Image: https://cloudplatform.googleblog.com/2015/01/what-makes-a-container-cluster.html
  18. 18. Kubernetes Franz Wimmer 18 Transfer rates taken with iPerf: Metric Transfer rate One pod, two containers 56,7 GBit/s One node, two pods 42,7 GBit/s Two nodes, one pod each 0,90 GBit/s
  19. 19. OpenShift Franz Wimmer 19 „Enterprise Kubernetes“ PaaS plattform Features: Templates Routes Editions: Online Container Platform Dedicated Origin (Open Source)
  20. 20. OpenShift Architecture Franz Wimmer 20 Grafik: https://docs.openshift.org/latest/architecture/index.html
  21. 21. BigData applications
  22. 22. Apache Spark Franz Wimmer 22 Framework for distributed computing For example: MapReduce Supports arbitrary backend storage services (SQL, HDFS, …) Master-Slave architecture Image: https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-architecture.html
  23. 23. SolrCloud Franz Wimmer 23 NoSQL Database Based on Lucene Features: Full text search REST interface Scalable and fault tolerant with shards and replicas SolrCloud: Distributed on multiple nodes Manages itself via ZooKeeper
  24. 24. ZooKeeper Franz Wimmer 24 Distributed configuration management Ensures consistency in distributed services Images: https://zookeeper.apache.org/doc/current/zookeeperOver.html https://blog.kloia.com/distributed-computing-in-microservices-cap-theorem-253c16017a99
  25. 25. Other cloud technologies in this thesis… … an incomplete list Franz Wimmer 25 Hadoop File System Cloudera Hue (Web-UI für Hadoop) ZooNavigator (Web-UI für ZooKeeper) Zeppelin (Web-UI für Spark)
  26. 26. Goals
  27. 27. Image: QAware GmbH Target architecture Franz Wimmer 27
  28. 28. Goals of this thesis Franz Wimmer 28 Proving the technology stack Automated deployment of BigData applications Impediments and pitfalls should be documented
  29. 29. 1. Introduction 2. Basic Cloud Computing 3. Used Technologies 4. Implementation 5. Security 6. Live Demo 7. Lessions learned 8. Summary
  30. 30. # OpenShift template for automated deployment of Solr apiVersion: v1 kind: Template metadata: name: spark-solr.template objects: - [...] - [...] OpenShift templates Franz Wimmer 30 Templates: Describe applications in the cluster Written in YAML Easy to read and write (depends ;) ) Can be parametrized Objects: z.B. Controllers like DeploymentConfigs or StatefulSets
  31. 31. [...] - apiVersion: v1 kind: DeploymentConfig spec: replicas: 1 template: spec: containers: - args: - sbin/start-master.sh env: - name: SPARK_MASTER_PORT value: '7077' image: spark:latest ports: - containerPort: 7077 name: spark-driver - containerPort: 8080 name: spark-mstr-http Stateless applications: DeploymentConfig Franz Wimmer 31
  32. 32. [...] - apiVersion: apps/v1beta1 kind: StatefulSet spec: replicas: 7 template: spec: containers: - command: - 'bin/solr -f -c -z zookeeper:2181 -s /solrdata' image: solr:latest volumeMounts: - name: solr-pvc mountPath: /solrdata volumeClaimTemplates: - spec: accessModes: [ "ReadWriteOnce" ] resources: requests: storage: "10Gi" Stateful applications: StatefulSets Franz Wimmer 32 YAML likeDeploymentConfig Additional guarantees: Stable hostnames Stable persistent volumes
  33. 33. Services Franz Wimmer 33 Pod 2 Label: app=zookeeper Port: 2181 Pod 3 Label: app=zookeeper Port: 2181 Client zookeeper:2181 Service Nam e: zookeeper Selector: app=zookeeper Port: 2181 Pod 1 Label: app=zookeeper Port: 2181
  34. 34. Routen Franz Wimmer 34 Service Nam e: spark Port: 8080 Pod Route To: spark Port: 8080 Client http:// spark.dom ain.tld:80 Pod OpenShift-Cluster
  35. 35. FROM solr:7.1-alpine USER root # Patch permissions to solr working dir RUN chgrp -R 0 /opt/solr && chmod -R 775 /opt/solr ADD entrypoint.sh /opt/solr/ RUN chmod 775 /opt/solr/entrypoint.sh USER $SOLR_USER ENTRYPOINT ["/opt/solr/entrypoint.sh"] Docker images Franz Wimmer 35 For applications in this thesis: Partially existing Docker images Partially self-built Docker images In any case: Customizing for OpenShift File permissions Custom actions on container start
  36. 36. Challenges Franz Wimmer 36 Not all applications are cloud native Assumptions that aren‘t true in the (OpenShift) cloud: Container runs with root privileges There is any username My own process id is > 1 Not all applications are well-documented Applications in a cluster are hard to debug
  37. 37. 1. Introduction 2. Basic Cloud Computing 3. Used Technologies 4. Implementation 5. Security 6. Live Demo 7. Lessions learned 8. Summary
  38. 38. Security: Motivation Franz Wimmer 38 „Western Digital My Cloud drives have a built-in backdoor” – techspot, 05.01.2018 „Homeland Security Data Breach Affects 240,000 Federal Employees, Plus Witnesses and Interviewees” – gizmodo, 03.01.2018 „Dynamics 365: Microsoft verteilt privaten Schlüssel an alle Kunden“ – golem.de, 08.12.2017 „Mehr als 100 Gigabyte: Vertrauliche Daten der NSA ungeschützt in der Cloud“ – heise online, 29.11.2017 „Hackerangriff: Uber verschleierte Datenklau von 57 Millionen Nutzern“ – computerbase, 22.11.2017 „Russian Hackers Stole NSA Data on U.S. Cyber Defense” – The Wall Street Journal, 05.10.2017 „Passwords To Access Over A Half Million CarTracking Devices Just Leaked Online” – gizmodo Australia, 23.09.2017 „Daten von Millionen Verizon-Kunden waren ungeschützt“ – heise online, 13.07.2017 „GOP Data Firm Accidentally Leaks Personal Details Of Nearly 200 Million AmericanVoters” – gizmodo Australia, 20.06.2017 „Telekom-Cloud-Kunde konnte fremde Adressbücher einsehen“ – golem.de, 08.12.2016
  39. 39. Risks of operating a cloud infrastructure Franz Wimmer 39 Sensitive data can be stolen Solution: Secure access to persistent volumes with SELinux Free network access between nodes, pods and containers Solution: Restrict network traffic with network policies Security holes in applications / containers Solution: Update your Docker images regularly! „Malware“ containers inside the cluster Solution: Operate your own Docker registry Access without authentication and authorization Solution: Use OpenShift as Oauth provider
  40. 40. Live - Demo
  41. 41. 1. Introduction 2. Basic Cloud Computing 3. Used Technologies 4. Implementation 5. Security 6. Live Demo 7. Lessons learned 8. Summary
  42. 42. Lessons learned Franz Wimmer 42 These lessons are from a real-world project: Do not operate a file system over a network! Unless it‘s a latency-free SAN. ZooKeeper cluster collapses => SolrCloud collapses Better: Local SSD storage Do not operate Solr inside OpenShift! Unless you got local SSD storage Solr relies on heavy read / write load Do not operate Solr inside containers! Solr uses memory mapped files for caching Memory mapped files are not managed by Docker / cgroups / namespaces This basically allows escaping from a container.
  43. 43. Evaluation & Summary Franz Wimmer 43 Technology stack is running on OpenShift … with some constraints OpenShift tools are suitable for that Docker images had to be customized Impediments and pitfalls are documented now … but there are many more. Not all applications are „cloud native“
  44. 44. Sources Franz Wimmer 44 Kapiteltrenner: pexels.com – CC0 License [Kac] Kachris, Christoforos & Tomkos, Ioannis. (2015). A Roadmap on Optical Interconnects in Data Centre Networks. . 10.1109/ICTON.2015.7193535. [Mel11] P. Mell, T. Grance et al. The NIST definition of cloud computing - Recommendations of the National Institute of Standards and Technology. NIST Special Publication 800-145, 2011. [War13] J. S. Ward und A. Barker. Undefined by data: a survey of big data definitions. arXiv preprint arXiv:1309.5821, 2013.
  45. 45. 27.09.2018 QAware 45
  46. 46. FranzWimmer franz.wimmer@qaware.de @zalintyre xing.com/companies/qawaregmbh linkedin.com/company/qaware-gmbh slideshare.net/qaware twitter.com/qaware github.com/qaware youtube.com/qawaregmbh

×