90% of the data in the world today has been created in the last two years. The world will be creating 163 zettabytes of data a year by 2025. So how do we want to process this volume of data?
Apache Spark is an open-source distributed general-purpose cluster computing framework that is trending today. But the problem is that how to create a computing cluster fast and efficient? Should I do all network configuration and cluster management myself? What should I do with my cluster if I don't need it anymore? Is my cluster secure?
After discovering Apache Spark principles and use cases, you will discover OVH Analytics Data Compute. A fast, secure, and efficient Spark Cluster as a Service which is going to give answers to all these questions.
Schema on read is obsolete. Welcome metaprogramming..pdf
OVH Spark Cluster as a Service Overview
1. 1
OVH Analytics Data Compute
April 2019
Apache Spark Cluster as a Service
Mojtaba Imani
DevOps Cloud and Bigdata @OVH
2. 2
Deep Learning
IoT Platform
Artificial Intelligence
Smart cities
Edge
Computing
Autonomous Vehicles
Blockchains
Biochips
Smart robots & drones
BY 2025, data will be x10 the total datasphere produced till 2016
- IDC Survey (2017) « Data Age 2025 »
The Cloud Transformation: A data odyssey…
“Exponential growth of data, its costs & governance:
IS YOUR BUSINESS READY FOR IT?”
3. 3
90 % of the whole data in the world has been created in last 2 years.
Autonomous cars generate 20TB of data per hour
5. 5
Apache Spark
• An open-source distributed general-purpose cluster-computing
framework.
• The largest open source community in big data
• Up to 100 times faster than hadoop mapReduce (in-memory
processing and lazy evaluation)
• The leading platform for large-scale SQL, batch processing, stream
processing, and machine learning
• Easy to use API and coding with Java, Scala, Python, R, and SQL
8. 8
Problem?
• Time and money
• Computer and Network skills and knowledge
• Maintenance
• Spark version
• Scale up
• Idle times
• Cloud Data Connection
12. 12
Problem?
• Time and money (installation and configuration)
• Computer and Network skills and knowledge
• Maintenance
• Spark version
• Scale up
• Idle times
• Cloud Data Connection
13. 13
Problem?
• Time and money (installation and configuration)
• Computer and Network skills and knowledge
• Maintenance
• Spark version
• Scale up
• Idle times
• Cloud Data Connection
ovh-spark-submit
17. 17
OVH Analytics Data Compute
• Same command line and options as original Spark command line
• Select Spark version ( --version 2.4.0)
• Cluster is only accessible through HTTPS.
• A new and dedicated cluster will be created for each request and it will be
deleted after finishing the job.
• You have the option to keep your cluster after finishing the job. ( --keep-infra)
• Your cluster is isolated from internet.
• Your cluster computers are created in your own Openstack project.
• Results and output logs will be saved in swift of your Openstack project.
• Input and output of data can be any source or format of data
19. 19
Download and use command line: ovh-spark-submit
Full manual: https://docs.ovh.com/gb/en/analytics-data-compute/labs/data-compute/getting-started-with-analytics-data-compute/
20. 20
Customer Feedbacks
A French TV Channels provider:
Analytics Data Compute helps us on defining new jobs without having
impact on our production pipelines.
A French Car Manufacturer:
With such a huge amount of data we have, we needed a solution that
could scale easily. Moreover, we did not needed a full hadoop stack. So,
Analytics Data Compute was the perfect candidate.
A French Bank :
We had trouble with spark versions and spark cluster management. With
Analytics Data Compute, we no longer have to bother with that kind of
problems.
21. 21
Global cloud hyperscalers concentration
cloud providers HQs dictate your data sovereignty options
Freedom Act
Cloud Act
GDPR
22. 22
OVH, A GLOBAL HYPER-SCALE CLOUD PROVIDER
KEY FACTS & FIGURES
1,500,000+
CUSTOMERS*
2200+
EMPLOYEES WORLDWIDE
1,5 BILLION €
OF INVESTMENT OVER 5 YEARS
98%
OF HOSTING ROOMS FREE FROM AIR
CONDITIONING
28 DATACENTERS
350,000 SERVERS
5000+
Partners and communities
*May 2018
23. 23
28 Datacenters
Owned, operated and racked by OVH
manufacturing units.
From sheet metal to automated
operations.
Hillsboro
r x1
Vint Hill
r x1
Beauharnois
r x7
Singapore
r x1
Sydney
r x1
Warsaw
r x1
London
r x1
Roubaix rx7
Gravelines rx2
Strasbourg rx4
Paris rx1
Frankfurt
r x1
24. 24
We operate our own global network and server manufacturing
Keeping control on security, scalability & Quality of Service
P.U.E
1.09
1MServers
produced
(oct.2018)
33xPoints
of Presence
260K
Public cloud
instances
16
Tbps.total network
capacity
350K
Physical servers
Patented Watercooling and manufacturing Owned and operated Data centers Proprietary Network & Anti DDoS security
« No ingress/egress charges on your network usage. »
25. 25
OVH Data Convergence Team
• Analytics Data Platform: A one-click pre-configured Hadoop stack designed to store and
process high volumes of data across OVH Public Cloud infrastructure.
• Analytics Data Compute: A one-time Apache Spark Cluster on the fly
• Analytics Data Collector: A cloud hosted agent to replicate, query and transport data