SlideShare a Scribd company logo
1 of 32
Download to read offline
© 2017 IBM Corporation
Deep dive into high performance analytics with
Apache Spark and Object Storage
Effi Ofer
effio@il.ibm.com
In collaboration with Elliot Kolodner, Gil Vernik, Kalman meth, Maya Anderson, and Michael
Factor from IBM Research as well as Francesco Pace and Pietro Michiardi from Eurecom
Mar 2017
A brave new object store world
© 2017 IBM Corporation
2Page© 2016 IBM Corporation 2Page© 2016 IBM Corporation
2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Data - the natural resource of the 21st century
With the rise of cloud, mobility, IoT, social and analytics, data explosion is accelerating.
This confluence of technologies has amplified the data explosion, creating incredible
growth-on-growth for unstructured data. New data sources are added daily, resulting in a
valuable data ecosystem for every business.
500
375
250
125
75 billion
Internet-connected
devices by 20202
90%
of all data was created
in the last 2 years1
3
Sources:
1. Science Daily, Big Data, for better or worse: 90% of world’s data generated over last two year, 2013
2. Business Insider, Morgan Stanley: 75 Billion Devices Will Be Connected to The Internet of Things By 2020, 2013
3. Digital Universe of Opportunities: Rich Data & The Increasing Value of the Internet of Things,
EMC Digital Universe with Research & Analysis by IDC, April 2014
Projected
Exabytes
© 2017 IBM Corporation
Historic
Traffic
Sensor Data
Extracting insights from data is critical
Are current traffic
conditions anomalous?
What are “normal” traffic
conditions at morning
rush hour?
How has the traffic
flow changed over
time?
How will the flow
change if a traffic lane
is removed?
© 2017 IBM Corporation
Agenda
– Introduction to storing data: Object Store and IBM COS
– Introduction to analyzing data: Apache Spark
– Deep dive into Spark and storage
© 2017 IBM Corporation
Object Storage
▪ High capacity, low cost storage
▪ Examples:
– IBM Cloud Object Storage (IBM COS)
– Azure Blob Storage
– OpenStack Swift
– Amazon S3
▪ Objects stored in flat address space
▪ An object encapsulates data and metadata
▪ Whole object creation; no update in-place
▪ Accessed through RESTful HTTP
– Ideal for mobile, natural for cloud
© 2017 IBM Corporation
6Page© 2016 IBM Corporation 6Page© 2016 IBM Corporation
IBM Cloud Object Storage:
designed for cloud-scale data
Address unstructured data storage from petabyte to exabyte
with reliability, security, availability and disaster recovery
Making IBM COS extremely durable, even at large scale
– Shared nothing architecture, with strong consistency for data
– Scalable namespace mapping with no centralized metadata
– Highly reliable and available without replication using erasure coding
– Distributed Rebuilder
– Distributed collection and storage of statistics needed for
Management
– APIs for integration with external management applications
– Support “lazy” handling of disk failures
– Automated network installation
© 2017 IBM Corporation
7Page© 2016 IBM Corporation 7Page© 2016 IBM Corporation Our object storage requires only 1.7 TBs raw storage capacity for 1 TB of usable storage.
0.56 TB
D.C.
0.56 TB
Dallas
0.56 TB
San Jose
1.7 TB
of raw storage
Three complete copies of the
object—plus overhead
—are distributed and
maintained in separate
locations in case of failure or
disaster. Resulting in 3.6 TB
of total storage consumed.
With traditional storage, a
single 1 TB object will be
replicated three times.
Traditional Storage
1 TB
of usable data
IBM Cloud Object Storage
Traditional storage requires 3.6 TBs raw storage capacity for 1 TB of usable storage.
With IBM Cloud Object
storage there’s no need
to store replicated data
in different systems.
A single TB of object storage
is encrypted and sliced but
never replicated.
Slices are distributed
geographically for durability
and availability.
You can lose some number
of slices due to failure or
disaster, and still quickly
recover 100% of your data.
IBM Cloud Object Storage
requires less than half the
storage and 70% lower TCO.
Traditional Storage makes Big Data even bigger
Multiple copies, more storage, more overhead
1.2 TB
D.C.
1.2 TB
Dallas
1.2 TB
San Jose
3.6 TB
of raw storage
IBM Cloud Object Storage is built for cloud-scale data
Just as reliable, less complex, more cost-efficient than traditional storage
1 TB
of usable data
© 2017 IBM Corporation
8Page© 2016 IBM Corporation 8Page© 2016 IBM Corporation
IBM Cloud Object Storage delivers industry
leading flexibility, scalability and simplicity
On-Premise
• Single tenant
• Design specific to needs
• Total control of system
• Local to on-premise compute
Dedicated
• No datacenter space required
• Single tenant
• Flexible configuration options
• OPEX vs CAPEX
Hybrid
Same as on-premise plus the following:
• Economic benefits of more dispersed sites (i.e., 3 rather than 2)
• On-premise storage replicated to the cloud
• Ability to add capacity to an on-premise deployment when there is no more data center
space available
Public
IBM managed options provide full management, monthly billing
Regional
Cross Regional
• Usage-based pricing
• Elastic capacity
• No datacenter space
required
• Fully managed
• Data local to in-cloud
compute
• Immediate worldwide
footprint
• OPEX vs CAPEX
© 2017 IBM Corporation
9Page© 2016 IBM Corporation 9Page© 2016 IBM Corporation
0.56 TB
D.C.
0.56 TB
Dallas
0.56 TB
San Jose
1.7 TB
of raw storage
IBM Cloud Object Storage
Slices are distributed geographically
for durability and availability.
- D.C
- Dallas
- San Jose
IBM Cloud Object Storage is available
on Bluemix
Traditional Storage makes Big Data even bigger
Multiple copies, more storage, more overhead
IBM Cloud Object Storage on Bluemix
Cross Regional Support
1 TB
of usable data
© 2017 IBM Corporation
Agenda
– Introduction to storing data: Object Store and IBM COS
–Introduction to analyzing data: Apache Spark
– Deep dive into Spark and storage
© 2017 IBM Corporation
Driving value from data:
Data analytics with Apache
Spark is an open source
in-memory application framework for
distributed data processing and
iterative analysis
on massive data volumes
Executor
Task Task
Executor
Task Task
Executor
Task Task
Driver
© 2017 IBM Corporation
Spark enables analytic over many different data sources
Spark Core
general compute
engine, handles
distributed task
dispatching, scheduling
and basic I/O functions
Spark SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph)
executes SQL
statements
performs
streaming
analytics using
micro-batches
common
machine
learning and
statistical
algorithms
distributed
graph
processing
framework
large variety of
data sources and
formats can be
supported, both
on-premise or
cloud
BigInsights
(HDFS)
Cloudant
dashDB
Object
Storage
SQL
DB
…many
others
IBM CLOUD OTHER CLOUD OTHER ON-PREM IBM ON-PREM
© 2017 IBM Corporation
Agenda
– Introduction to storing data: Object Store and IBM COS
– Introduction to analyzing data: Apache Spark
–Deep dive into Spark and storage
© 2017 IBM Corporation
Connecting Spark to Storage
▪ Spark interacts with its storage system using the Hadoop
Filesystem
▪ A connector is implemented for each storage system such as
– HDFS (Hadoop Distributed File System)
– Object Storage such as IBM COS, S3, OpenStack Swift
▪ Example of reading and writing from HDFS
data = sc.textFile(“hdfs://vault1/inputObjectName”)
data.saveAsTextFile(“hdfs://vault1/outputObjectName”)
▪ Example of reading and writing from IBM Cloud Object Storage
data = sc.textFile(“s3d://vault1.service/inputObjectName”)
data.saveAsTextFile(“s3d://vault1.service/outputObjectName”)
Hadoop Filesystem Interface
© 2017 IBM Corporation
Using HDFS with co-located storage
and compute has its pain points...
▪ Scale storage independently
– Directly use data stored for other reasons
▪ Same cloud and on-prem experiences
▪ Need to scale compute with storage
– Poor match to explosive data growth
▪ Cloud and on-prem experiences differ
– Cloud services run on object storage
Spark
HDFS
Spark
HDFS
Traditional Deployment
COS
Spark Spark
Deployment with IBM Cloud Object Storage
…which object storage as the
persistent storage layer addresses
© 2017 IBM Corporation
How Spark writes to HDFS
The Spark driver and executor recursively create
the directories for the task temporary, job
temporary and final output (steps 1-2)
The task outputs the task temporary file (step 3).
At task commit the executor lists the task
temporary directory, and renames the file it finds to
its job temporary name (steps 4-5).
At job commit the driver recursively lists the job
temporary directories and renames the file it finds
to its final names (steps 6-7).
The driver writes the _SUCCESS object.
1. Spark driver: make directories recursively:
hdfs://res/data.txt/_temporary/0
2. Spark executor: make directories recursively:
hdfs://res/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
3. Spark executor: write task temporary object
hdfs://res/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001
4. Spark executor: list directory:
hdfs://res/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
5. Spark executor: rename task temporary object to job temporary object:
hdfs://res/data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00000
6. Spark driver: list job temporary directories recursively:
hdfs://res/data.txt/_temporary/0/task_201702221313_0000_m_000001
7. Spark driver: rename job temporary object to final name:
hdfs://res/data.txt/part-00000
8. Spark driver: write _SUCCESS object
hdfs://res/data.txt/_SUCCESS
Simple code that writes a single file to storage:
val data = Array(1)
val distData = sc.parallelize(data)
val finalData = distData.coalesce(1)
finalData.saveAsTextFile(“hdfs://vault1/data.txt")
© 2017 IBM Corporation
How Spark writes to Object Storage
File operations are translated to RESTful calls:
• HEAD
• GET
• PUT
• COPY
• DELETE
1. Spark driver: make directories recursively:
s3a:// vault1/data.txt/_temporary/0
2. Spark executor: make directories recursively:
s3a:// vault1/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
3. Spark executor: write task temporary object
s3a:// vault1/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001
4. Spark executor: list directory:
s3a:// vault1/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1
5. Spark executor: rename task temporary object to job temporary object:
s3a:// vault1/data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00000
6. Spark driver: list job temporary directories recursively:
s3a:// vault1/data.txt/_temporary/0/task_201702221313_0000_m_000001
7. Spark driver: rename job temporary object to final name:
s3a:// vault1/data.txt/part-00000
8. Spark driver: write _SUCCESS object
s3a://vault1/data.txt/_SUCCESS
Simple code that writes a single file to storage:
val data = Array(1)
val distData = sc.parallelize(data)
val finalData = distData.coalesce(1)
finalData.saveAsTextFile(“s3a://vault1/data.txt")
© 2017 IBM Corporation
So what is going on here? What is written to Storage?
▪ Output to storage using the Hadoop FileOutputCommitter
▪ Each task execution
– Attempts to write its own task temporary file
– At task commit, renames the task temporary file to a job temporary file
– Task commit is done by the executors, so it occurs in parallel.
▪ When all of the tasks of a job complete, the driver
– Calls the output committer to do job commit
– Renames the job temporary files to their final names
– Task commit occurs in the driver after all of the tasks have committed and does not benefit from
parallelism
© 2017 IBM Corporation
Why such complexity ?
Avoids incomplete results being misinterpreted as complete
Can it be simplified ?
Output committer version 2
– Task temporary files are renamed to their final names at task commit
– job commit is largely reduced to the writing of the _SUCCESS object
– However, as of Hadoop 2.7.3, this algorithm is not yet default
https://issues.apache.org/jira/browse/MAPREDUCE-6336
Can it be improved even further ?
© 2017 IBM Corporation
Introducing Stocator
▪ A fast object store connector for Apache Spark that takes advantage of object store semantics
▪ Available at https://github.com/SparkTC/stocator
▪ Marks output as made by Stocator
– The driver is responsible for creating a ‘directory’ to hold the output dataset
– Stocator makes use of this ‘directory’ as a marker that it wrote the output
– This ‘directory’ is a zero byte object with the name of the dataset
▪ Avoid renames
– When asked to create a temporary object,
recognize the pattern of the name and writes the object directly to its final name
– For example
<dataset-name>/_temporary/0/_temporary/attempt_<job-timestamp>_0000_m_000000_<attempt-number>/part-<part-number>
– Becomes
<dataset-name>/part-<part-number>_attempt_<job-timestamp>_0000_m_000000_<attempt-number>
▪ When all tasks are done, write the _SUCCESS object
Hadoop Filesystem Interface
Stocator
© 2017 IBM Corporation
Writing to Object Storage using Stocator
1. Spark driver: create a ‘directory’
s3d:// vault1.service/data.txt
2. Spark driver: get container
s3d:// vault1.service/data.txt
3. Spark executor: create output object:
s3d:// vault1.service/data.txt/part-00000-attempt_201702231115_0001_m_000000_1000
4. Spark driver: write _SUCCESS object
s3d://vault1.service/data.txt/_SUCCESS
Simple code that writes a single file to storage:
val data = Array(1)
val distData = sc.parallelize(data)
val finalData = distData.coalesce(1)
finalData.saveAsTextFile(“s3d://vault1.service/data.txt")
© 2017 IBM Corporation
Wordcount using Stocator
First, provide the credentials for the object store where the data resides:
© 2017 IBM Corporation
Wordcount using Stocator
And now write your code to access and manipulate the data:
© 2017 IBM Corporation
Wordcount using Stocator
Finally, this is our output:
© 2017 IBM Corporation
Performance evaluations
▪ Spark cluster
– Spark 2.0.1
• 3 x Dual Intel Xeon E52690 (12 Hyper-threaded Cores, 2.60 GHz)
• 256 GB RAM, 1 x 10Gbps, 1 x 1TB SATA
▪ Object store
– IBM Cloud Object Storage cluster
#Jobs #Stages Input size Output size
Copy 1 1 50 GB 50 GB
Read 1 1 50 GB --
Teragen 1 1 -- 50 GB
Wordcount 1 2 50 GB 1.6 MB
Terasort 2 4 50 GB 50 GB
TPC-DS 112 179 50 GB (raw),
15 GB
Parquet
--
© 2017 IBM Corporation
Comparing Stocator to base Hadoop swift and s3a
0
100
200
300
400
500
600
700
800
Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS
Seconds
Stocator Hadoop Swift S3a
18x 10x 9x 2x 1x1x 1x*
Stocator is
• much faster for write workloads
• about the same for read workloads
* Comparing stocator to s3a
© 2017 IBM Corporation
Comparing Stocator to s3a with non default features
0
100
200
300
400
500
600
700
800
Teragen Copy Terasort Wordcount Read (50 GB) Read (500 GB) TPC-DS
seconds
Stocator S3a S3a CV2 S3a CV2+FU
1.5x 1.3x 1.3x 1.1x 1x1x 1x*
* Comparing stocator to s3a commit version 2 + fast upload
© 2017 IBM Corporation
Comparing REST operations
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS
RESTfuloperations
Stocator Hadoop Swift S3a
33x 25x 24x 25x 2x2x 2x*
Stocator has a lower impact
on the object store than s3a
• Reduce cost
• Reduce overhead
* Comparing stocator to s3a
© 2017 IBM Corporation
Questions?
© 2017 IBM Corporation
Backup
Slides
© 2017 IBM Corporation
Reading a dataset from Stocator
▪ Confirm that the dataset was produced by Stocator
– Using the metadata from the initial ‘directory’
▪ Confirm that the _SUCCESS object exists
▪ Lists the object parts belonging to the dataset
– Using GET container RESTful call
▪ Are there multiple objects from different execution attempts?
– Choose the one that has the most data
– Given
• fail-stop assumption (i.e. Spark server executes correctly until it halts)
• All successful execution attempts write the same output
• There are no in place updates in an object store
• At least one attempt succeeded (evident by the _SUCCESS object)
© 2017 IBM Corporation
Additional optimizations
▪ Streaming of output
– Normally, the length of an object is a parameter to a PUT operation
– But this means Spark needs to cache the entire object prior to issuing PUT
– Hadoop swift and s3a by default cache the object in a local file system
– Stocator leverages HTTP chunked transfer encoding
• Object is sent in chucked (64KB in stocator)
• No need to know final object length before issuing the PUT
– Similar to s3a fast upload
• Minimum size of 5MB per part
▪ Avoid HEAD operation just before GET
– Often the HEAD is used just to confirm that the object exists and determine its size
– However GET also returns the object meta data
– In many cases Stocator is able to avoid the extra HEAD call before a GET
▪ Cache the results of HEAD operations
– Spark assumes input dataset is immutable

More Related Content

What's hot

Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudAlluxio, Inc.
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Alluxio, Inc.
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyAlluxio, Inc.
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Holden Ackerman
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMAlluxio, Inc.
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsAmazon Web Services
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAlluxio, Inc.
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIsCisco DevNet
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data AnalyticsAmazon Web Services
 
Decoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data WorkloadsDecoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data WorkloadsAlluxio, Inc.
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaAlluxio, Inc.
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAlluxio, Inc.
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Cloud Expo NYC 2017: Running Databases in Containers
Cloud Expo NYC 2017: Running Databases in Containers Cloud Expo NYC 2017: Running Databases in Containers
Cloud Expo NYC 2017: Running Databases in Containers Ocean9, Inc.
 

What's hot (20)

Data Orchestration Platform for the Cloud
Data Orchestration Platform for the CloudData Orchestration Platform for the Cloud
Data Orchestration Platform for the Cloud
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3Accelerate Spark Workloads on S3
Accelerate Spark Workloads on S3
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
Presto & differences between popular SQL engines (Spark, Redshift, and Hive)
 
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBMPowering Data Science and AI with Apache Spark, Alluxio, and IBM
Powering Data Science and AI with Apache Spark, Alluxio, and IBM
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Optimizing Storage for Big Data Workloads
Optimizing Storage for Big Data WorkloadsOptimizing Storage for Big Data Workloads
Optimizing Storage for Big Data Workloads
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
DEVNET-1166 Open SDN Controller APIs
DEVNET-1166	Open SDN Controller APIsDEVNET-1166	Open SDN Controller APIs
DEVNET-1166 Open SDN Controller APIs
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Decoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data WorkloadsDecoupling Compute and Storage for Data Workloads
Decoupling Compute and Storage for Data Workloads
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
Hybrid Data Lake Architecture with Presto & Spark in the cloud accessing on-p...
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Practical Use of a NoSQL Database
Practical Use of a NoSQL DatabasePractical Use of a NoSQL Database
Practical Use of a NoSQL Database
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Cloud Expo NYC 2017: Running Databases in Containers
Cloud Expo NYC 2017: Running Databases in Containers Cloud Expo NYC 2017: Running Databases in Containers
Cloud Expo NYC 2017: Running Databases in Containers
 

Viewers also liked

How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheLeslie Samuel
 
OpenPOWER SC16 Recap: Day 1
OpenPOWER SC16 Recap: Day 1OpenPOWER SC16 Recap: Day 1
OpenPOWER SC16 Recap: Day 1OpenPOWERorg
 
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Indrajit Poddar
 
Machine Learning and The Big Data Revolution
Machine Learning and The Big Data RevolutionMachine Learning and The Big Data Revolution
Machine Learning and The Big Data RevolutionRob Thomas
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...Romeo Kienzler
 
PythonでDeepLearningを始めるよ
PythonでDeepLearningを始めるよPythonでDeepLearningを始めるよ
PythonでDeepLearningを始めるよTanaka Yuichi
 
人工知能技術のエンタープライズシステムへの適用
人工知能技術のエンタープライズシステムへの適用人工知能技術のエンタープライズシステムへの適用
人工知能技術のエンタープライズシステムへの適用Miki Yutani
 
金融業界におけるAPIエコノミー / Fintech meetup / IBM
金融業界におけるAPIエコノミー / Fintech meetup / IBM金融業界におけるAPIエコノミー / Fintech meetup / IBM
金融業界におけるAPIエコノミー / Fintech meetup / IBMRasmus Ekman
 
Cognitive Computing
Cognitive ComputingCognitive Computing
Cognitive ComputingPietro Leo
 
IoTとAIが牽引するエンタープライズシステムの新展開
IoTとAIが牽引するエンタープライズシステムの新展開IoTとAIが牽引するエンタープライズシステムの新展開
IoTとAIが牽引するエンタープライズシステムの新展開Miki Yutani
 
Watsonをささえる ハイパフォーマンスクラウドで はじめるDeep Learning
Watsonをささえる ハイパフォーマンスクラウドで はじめるDeep LearningWatsonをささえる ハイパフォーマンスクラウドで はじめるDeep Learning
Watsonをささえる ハイパフォーマンスクラウドで はじめるDeep LearningAtsumori Sasaki
 
UN decade of Action for Road Safety
UN decade of Action for Road SafetyUN decade of Action for Road Safety
UN decade of Action for Road SafetyPODIS Ltd
 
Making workflow implementation easy with CQRS
Making workflow implementation easy with CQRSMaking workflow implementation easy with CQRS
Making workflow implementation easy with CQRSParticular Software
 
Petteri Paasio: Miten tästä eteenpäin?
Petteri Paasio: Miten tästä eteenpäin?Petteri Paasio: Miten tästä eteenpäin?
Petteri Paasio: Miten tästä eteenpäin?Minna Kivipelto
 
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...Patrick Bouillaud
 
IRISH & INTERNATIONAL ART AUCTION 10th April 2017
IRISH & INTERNATIONAL ART AUCTION 10th April 2017IRISH & INTERNATIONAL ART AUCTION 10th April 2017
IRISH & INTERNATIONAL ART AUCTION 10th April 2017Morgan O'Driscoll
 
Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)
Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)
Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)SteveJohnson125
 

Viewers also liked (20)

How to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your NicheHow to Become a Thought Leader in Your Niche
How to Become a Thought Leader in Your Niche
 
OpenPOWER SC16 Recap: Day 1
OpenPOWER SC16 Recap: Day 1OpenPOWER SC16 Recap: Day 1
OpenPOWER SC16 Recap: Day 1
 
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
Enabling Cognitive Workloads on the Cloud: GPUs with Mesos, Docker and Marath...
 
Machine Learning and The Big Data Revolution
Machine Learning and The Big Data RevolutionMachine Learning and The Big Data Revolution
Machine Learning and The Big Data Revolution
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
 
PythonでDeepLearningを始めるよ
PythonでDeepLearningを始めるよPythonでDeepLearningを始めるよ
PythonでDeepLearningを始めるよ
 
人工知能技術のエンタープライズシステムへの適用
人工知能技術のエンタープライズシステムへの適用人工知能技術のエンタープライズシステムへの適用
人工知能技術のエンタープライズシステムへの適用
 
金融業界におけるAPIエコノミー / Fintech meetup / IBM
金融業界におけるAPIエコノミー / Fintech meetup / IBM金融業界におけるAPIエコノミー / Fintech meetup / IBM
金融業界におけるAPIエコノミー / Fintech meetup / IBM
 
Cognitive Computing
Cognitive ComputingCognitive Computing
Cognitive Computing
 
IoTとAIが牽引するエンタープライズシステムの新展開
IoTとAIが牽引するエンタープライズシステムの新展開IoTとAIが牽引するエンタープライズシステムの新展開
IoTとAIが牽引するエンタープライズシステムの新展開
 
Watsonをささえる ハイパフォーマンスクラウドで はじめるDeep Learning
Watsonをささえる ハイパフォーマンスクラウドで はじめるDeep LearningWatsonをささえる ハイパフォーマンスクラウドで はじめるDeep Learning
Watsonをささえる ハイパフォーマンスクラウドで はじめるDeep Learning
 
UN decade of Action for Road Safety
UN decade of Action for Road SafetyUN decade of Action for Road Safety
UN decade of Action for Road Safety
 
Il turismo nello scenario internazionale
Il turismo nello scenario internazionaleIl turismo nello scenario internazionale
Il turismo nello scenario internazionale
 
Making workflow implementation easy with CQRS
Making workflow implementation easy with CQRSMaking workflow implementation easy with CQRS
Making workflow implementation easy with CQRS
 
Adam's At Home Auction April 9th 2017
 Adam's At Home Auction April 9th 2017 Adam's At Home Auction April 9th 2017
Adam's At Home Auction April 9th 2017
 
Petteri Paasio: Miten tästä eteenpäin?
Petteri Paasio: Miten tästä eteenpäin?Petteri Paasio: Miten tästä eteenpäin?
Petteri Paasio: Miten tästä eteenpäin?
 
Técnicas anti violación y secuestro nivel 1
Técnicas anti violación y secuestro nivel 1Técnicas anti violación y secuestro nivel 1
Técnicas anti violación y secuestro nivel 1
 
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
Protection des données personnelles Se mettre en conformité : Pourquoi ? Comm...
 
IRISH & INTERNATIONAL ART AUCTION 10th April 2017
IRISH & INTERNATIONAL ART AUCTION 10th April 2017IRISH & INTERNATIONAL ART AUCTION 10th April 2017
IRISH & INTERNATIONAL ART AUCTION 10th April 2017
 
Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)
Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)
Thomas Woznicki vs. Dennis Erickson Court Filing (September 1995)
 

Similar to A Brave new object store world

S104876 ibm-cos-jburg-v1809b
S104876 ibm-cos-jburg-v1809bS104876 ibm-cos-jburg-v1809b
S104876 ibm-cos-jburg-v1809bTony Pearson
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cTony Pearson
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AITorsten Steinbach
 
Three ways object storage can save you time in 2017
Three ways object storage can save you time in 2017Three ways object storage can save you time in 2017
Three ways object storage can save you time in 2017Maciej Lasota
 
S106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aS106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aTony Pearson
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics systemModusOptimum
 
Simplicity Without Compromise Building a Cognitive Cloud
Simplicity Without Compromise Building a Cognitive CloudSimplicity Without Compromise Building a Cognitive Cloud
Simplicity Without Compromise Building a Cognitive CloudNEXTtour
 
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudTorsten Steinbach
 
S104873 nas-sizing-jburg-v1809d
S104873 nas-sizing-jburg-v1809dS104873 nas-sizing-jburg-v1809d
S104873 nas-sizing-jburg-v1809dTony Pearson
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015Doug O'Flaherty
 
Design - Building a Foundation for Hybrid Cloud Storage
Design - Building a Foundation for Hybrid Cloud StorageDesign - Building a Foundation for Hybrid Cloud Storage
Design - Building a Foundation for Hybrid Cloud StorageLaurenWendler
 
IBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesIBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesTony Pearson
 
Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...
Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...
Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...Amazon Web Services
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningModusOptimum
 
IBM Storage for Hybrid Cloud (4Q 2016)
IBM Storage for Hybrid Cloud (4Q 2016)IBM Storage for Hybrid Cloud (4Q 2016)
IBM Storage for Hybrid Cloud (4Q 2016)Elan Freedberg
 
S108283 svc-storwize-lagos-v1905d
S108283 svc-storwize-lagos-v1905dS108283 svc-storwize-lagos-v1905d
S108283 svc-storwize-lagos-v1905dTony Pearson
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakejavier ramirez
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2Raul Chong
 

Similar to A Brave new object store world (20)

S104876 ibm-cos-jburg-v1809b
S104876 ibm-cos-jburg-v1809bS104876 ibm-cos-jburg-v1809b
S104876 ibm-cos-jburg-v1809b
 
S100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804cS100299 ibm-cos-orlando-v1804c
S100299 ibm-cos-orlando-v1804c
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Three ways object storage can save you time in 2017
Three ways object storage can save you time in 2017Three ways object storage can save you time in 2017
Three ways object storage can save you time in 2017
 
S106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902aS106195 cos-use cases-istanbul-v1902a
S106195 cos-use cases-istanbul-v1902a
 
Ibm integrated analytics system
Ibm integrated analytics systemIbm integrated analytics system
Ibm integrated analytics system
 
Simplicity Without Compromise Building a Cognitive Cloud
Simplicity Without Compromise Building a Cognitive CloudSimplicity Without Compromise Building a Cognitive Cloud
Simplicity Without Compromise Building a Cognitive Cloud
 
Big Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS CloudBig Data Building Blocks with AWS Cloud
Big Data Building Blocks with AWS Cloud
 
Ibm db2 big sql
Ibm db2 big sqlIbm db2 big sql
Ibm db2 big sql
 
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM CloudIBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
IBM THINK 2019 - A Sharing Economy for Analytics: SQL Query in IBM Cloud
 
S104873 nas-sizing-jburg-v1809d
S104873 nas-sizing-jburg-v1809dS104873 nas-sizing-jburg-v1809d
S104873 nas-sizing-jburg-v1809d
 
IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015IBM Spectrum Scale Overview november 2015
IBM Spectrum Scale Overview november 2015
 
Design - Building a Foundation for Hybrid Cloud Storage
Design - Building a Foundation for Hybrid Cloud StorageDesign - Building a Foundation for Hybrid Cloud Storage
Design - Building a Foundation for Hybrid Cloud Storage
 
IBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use casesIBM Cloud Object Storage: How it works and typical use cases
IBM Cloud Object Storage: How it works and typical use cases
 
Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...
Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...
Hybrid as a Stepping Stone: It’s Not All or Nothing for Your Cloud Transforma...
 
The Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine LearningThe Future of Data Warehousing, Data Science and Machine Learning
The Future of Data Warehousing, Data Science and Machine Learning
 
IBM Storage for Hybrid Cloud (4Q 2016)
IBM Storage for Hybrid Cloud (4Q 2016)IBM Storage for Hybrid Cloud (4Q 2016)
IBM Storage for Hybrid Cloud (4Q 2016)
 
S108283 svc-storwize-lagos-v1905d
S108283 svc-storwize-lagos-v1905dS108283 svc-storwize-lagos-v1905d
S108283 svc-storwize-lagos-v1905d
 
From raw data to business insights. A modern data lake
From raw data to business insights. A modern data lakeFrom raw data to business insights. A modern data lake
From raw data to business insights. A modern data lake
 
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part20812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
0812 2014 01_toronto-smac meetup_i_os_cloudant_worklight_part2
 

Recently uploaded

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxdolaknnilon
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 

Recently uploaded (20)

1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
IMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptxIMA MSN - Medical Students Network (2).pptx
IMA MSN - Medical Students Network (2).pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 

A Brave new object store world

  • 1. © 2017 IBM Corporation Deep dive into high performance analytics with Apache Spark and Object Storage Effi Ofer effio@il.ibm.com In collaboration with Elliot Kolodner, Gil Vernik, Kalman meth, Maya Anderson, and Michael Factor from IBM Research as well as Francesco Pace and Pietro Michiardi from Eurecom Mar 2017 A brave new object store world
  • 2. © 2017 IBM Corporation 2Page© 2016 IBM Corporation 2Page© 2016 IBM Corporation 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 Data - the natural resource of the 21st century With the rise of cloud, mobility, IoT, social and analytics, data explosion is accelerating. This confluence of technologies has amplified the data explosion, creating incredible growth-on-growth for unstructured data. New data sources are added daily, resulting in a valuable data ecosystem for every business. 500 375 250 125 75 billion Internet-connected devices by 20202 90% of all data was created in the last 2 years1 3 Sources: 1. Science Daily, Big Data, for better or worse: 90% of world’s data generated over last two year, 2013 2. Business Insider, Morgan Stanley: 75 Billion Devices Will Be Connected to The Internet of Things By 2020, 2013 3. Digital Universe of Opportunities: Rich Data & The Increasing Value of the Internet of Things, EMC Digital Universe with Research & Analysis by IDC, April 2014 Projected Exabytes
  • 3. © 2017 IBM Corporation Historic Traffic Sensor Data Extracting insights from data is critical Are current traffic conditions anomalous? What are “normal” traffic conditions at morning rush hour? How has the traffic flow changed over time? How will the flow change if a traffic lane is removed?
  • 4. © 2017 IBM Corporation Agenda – Introduction to storing data: Object Store and IBM COS – Introduction to analyzing data: Apache Spark – Deep dive into Spark and storage
  • 5. © 2017 IBM Corporation Object Storage ▪ High capacity, low cost storage ▪ Examples: – IBM Cloud Object Storage (IBM COS) – Azure Blob Storage – OpenStack Swift – Amazon S3 ▪ Objects stored in flat address space ▪ An object encapsulates data and metadata ▪ Whole object creation; no update in-place ▪ Accessed through RESTful HTTP – Ideal for mobile, natural for cloud
  • 6. © 2017 IBM Corporation 6Page© 2016 IBM Corporation 6Page© 2016 IBM Corporation IBM Cloud Object Storage: designed for cloud-scale data Address unstructured data storage from petabyte to exabyte with reliability, security, availability and disaster recovery Making IBM COS extremely durable, even at large scale – Shared nothing architecture, with strong consistency for data – Scalable namespace mapping with no centralized metadata – Highly reliable and available without replication using erasure coding – Distributed Rebuilder – Distributed collection and storage of statistics needed for Management – APIs for integration with external management applications – Support “lazy” handling of disk failures – Automated network installation
  • 7. © 2017 IBM Corporation 7Page© 2016 IBM Corporation 7Page© 2016 IBM Corporation Our object storage requires only 1.7 TBs raw storage capacity for 1 TB of usable storage. 0.56 TB D.C. 0.56 TB Dallas 0.56 TB San Jose 1.7 TB of raw storage Three complete copies of the object—plus overhead —are distributed and maintained in separate locations in case of failure or disaster. Resulting in 3.6 TB of total storage consumed. With traditional storage, a single 1 TB object will be replicated three times. Traditional Storage 1 TB of usable data IBM Cloud Object Storage Traditional storage requires 3.6 TBs raw storage capacity for 1 TB of usable storage. With IBM Cloud Object storage there’s no need to store replicated data in different systems. A single TB of object storage is encrypted and sliced but never replicated. Slices are distributed geographically for durability and availability. You can lose some number of slices due to failure or disaster, and still quickly recover 100% of your data. IBM Cloud Object Storage requires less than half the storage and 70% lower TCO. Traditional Storage makes Big Data even bigger Multiple copies, more storage, more overhead 1.2 TB D.C. 1.2 TB Dallas 1.2 TB San Jose 3.6 TB of raw storage IBM Cloud Object Storage is built for cloud-scale data Just as reliable, less complex, more cost-efficient than traditional storage 1 TB of usable data
  • 8. © 2017 IBM Corporation 8Page© 2016 IBM Corporation 8Page© 2016 IBM Corporation IBM Cloud Object Storage delivers industry leading flexibility, scalability and simplicity On-Premise • Single tenant • Design specific to needs • Total control of system • Local to on-premise compute Dedicated • No datacenter space required • Single tenant • Flexible configuration options • OPEX vs CAPEX Hybrid Same as on-premise plus the following: • Economic benefits of more dispersed sites (i.e., 3 rather than 2) • On-premise storage replicated to the cloud • Ability to add capacity to an on-premise deployment when there is no more data center space available Public IBM managed options provide full management, monthly billing Regional Cross Regional • Usage-based pricing • Elastic capacity • No datacenter space required • Fully managed • Data local to in-cloud compute • Immediate worldwide footprint • OPEX vs CAPEX
  • 9. © 2017 IBM Corporation 9Page© 2016 IBM Corporation 9Page© 2016 IBM Corporation 0.56 TB D.C. 0.56 TB Dallas 0.56 TB San Jose 1.7 TB of raw storage IBM Cloud Object Storage Slices are distributed geographically for durability and availability. - D.C - Dallas - San Jose IBM Cloud Object Storage is available on Bluemix Traditional Storage makes Big Data even bigger Multiple copies, more storage, more overhead IBM Cloud Object Storage on Bluemix Cross Regional Support 1 TB of usable data
  • 10. © 2017 IBM Corporation Agenda – Introduction to storing data: Object Store and IBM COS –Introduction to analyzing data: Apache Spark – Deep dive into Spark and storage
  • 11. © 2017 IBM Corporation Driving value from data: Data analytics with Apache Spark is an open source in-memory application framework for distributed data processing and iterative analysis on massive data volumes Executor Task Task Executor Task Task Executor Task Task Driver
  • 12. © 2017 IBM Corporation Spark enables analytic over many different data sources Spark Core general compute engine, handles distributed task dispatching, scheduling and basic I/O functions Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph) executes SQL statements performs streaming analytics using micro-batches common machine learning and statistical algorithms distributed graph processing framework large variety of data sources and formats can be supported, both on-premise or cloud BigInsights (HDFS) Cloudant dashDB Object Storage SQL DB …many others IBM CLOUD OTHER CLOUD OTHER ON-PREM IBM ON-PREM
  • 13. © 2017 IBM Corporation Agenda – Introduction to storing data: Object Store and IBM COS – Introduction to analyzing data: Apache Spark –Deep dive into Spark and storage
  • 14. © 2017 IBM Corporation Connecting Spark to Storage ▪ Spark interacts with its storage system using the Hadoop Filesystem ▪ A connector is implemented for each storage system such as – HDFS (Hadoop Distributed File System) – Object Storage such as IBM COS, S3, OpenStack Swift ▪ Example of reading and writing from HDFS data = sc.textFile(“hdfs://vault1/inputObjectName”) data.saveAsTextFile(“hdfs://vault1/outputObjectName”) ▪ Example of reading and writing from IBM Cloud Object Storage data = sc.textFile(“s3d://vault1.service/inputObjectName”) data.saveAsTextFile(“s3d://vault1.service/outputObjectName”) Hadoop Filesystem Interface
  • 15. © 2017 IBM Corporation Using HDFS with co-located storage and compute has its pain points... ▪ Scale storage independently – Directly use data stored for other reasons ▪ Same cloud and on-prem experiences ▪ Need to scale compute with storage – Poor match to explosive data growth ▪ Cloud and on-prem experiences differ – Cloud services run on object storage Spark HDFS Spark HDFS Traditional Deployment COS Spark Spark Deployment with IBM Cloud Object Storage …which object storage as the persistent storage layer addresses
  • 16. © 2017 IBM Corporation How Spark writes to HDFS The Spark driver and executor recursively create the directories for the task temporary, job temporary and final output (steps 1-2) The task outputs the task temporary file (step 3). At task commit the executor lists the task temporary directory, and renames the file it finds to its job temporary name (steps 4-5). At job commit the driver recursively lists the job temporary directories and renames the file it finds to its final names (steps 6-7). The driver writes the _SUCCESS object. 1. Spark driver: make directories recursively: hdfs://res/data.txt/_temporary/0 2. Spark executor: make directories recursively: hdfs://res/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1 3. Spark executor: write task temporary object hdfs://res/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001 4. Spark executor: list directory: hdfs://res/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1 5. Spark executor: rename task temporary object to job temporary object: hdfs://res/data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00000 6. Spark driver: list job temporary directories recursively: hdfs://res/data.txt/_temporary/0/task_201702221313_0000_m_000001 7. Spark driver: rename job temporary object to final name: hdfs://res/data.txt/part-00000 8. Spark driver: write _SUCCESS object hdfs://res/data.txt/_SUCCESS Simple code that writes a single file to storage: val data = Array(1) val distData = sc.parallelize(data) val finalData = distData.coalesce(1) finalData.saveAsTextFile(“hdfs://vault1/data.txt")
  • 17. © 2017 IBM Corporation How Spark writes to Object Storage File operations are translated to RESTful calls: • HEAD • GET • PUT • COPY • DELETE 1. Spark driver: make directories recursively: s3a:// vault1/data.txt/_temporary/0 2. Spark executor: make directories recursively: s3a:// vault1/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1 3. Spark executor: write task temporary object s3a:// vault1/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1/part-00001 4. Spark executor: list directory: s3a:// vault1/data.txt/_temporary/0/_temporary/attempt_201702221313_0000_m_000001_1 5. Spark executor: rename task temporary object to job temporary object: s3a:// vault1/data.txt/_temporary/0/task_201702221313_0000_m_000001/part-00000 6. Spark driver: list job temporary directories recursively: s3a:// vault1/data.txt/_temporary/0/task_201702221313_0000_m_000001 7. Spark driver: rename job temporary object to final name: s3a:// vault1/data.txt/part-00000 8. Spark driver: write _SUCCESS object s3a://vault1/data.txt/_SUCCESS Simple code that writes a single file to storage: val data = Array(1) val distData = sc.parallelize(data) val finalData = distData.coalesce(1) finalData.saveAsTextFile(“s3a://vault1/data.txt")
  • 18. © 2017 IBM Corporation So what is going on here? What is written to Storage? ▪ Output to storage using the Hadoop FileOutputCommitter ▪ Each task execution – Attempts to write its own task temporary file – At task commit, renames the task temporary file to a job temporary file – Task commit is done by the executors, so it occurs in parallel. ▪ When all of the tasks of a job complete, the driver – Calls the output committer to do job commit – Renames the job temporary files to their final names – Task commit occurs in the driver after all of the tasks have committed and does not benefit from parallelism
  • 19. © 2017 IBM Corporation Why such complexity ? Avoids incomplete results being misinterpreted as complete Can it be simplified ? Output committer version 2 – Task temporary files are renamed to their final names at task commit – job commit is largely reduced to the writing of the _SUCCESS object – However, as of Hadoop 2.7.3, this algorithm is not yet default https://issues.apache.org/jira/browse/MAPREDUCE-6336 Can it be improved even further ?
  • 20. © 2017 IBM Corporation Introducing Stocator ▪ A fast object store connector for Apache Spark that takes advantage of object store semantics ▪ Available at https://github.com/SparkTC/stocator ▪ Marks output as made by Stocator – The driver is responsible for creating a ‘directory’ to hold the output dataset – Stocator makes use of this ‘directory’ as a marker that it wrote the output – This ‘directory’ is a zero byte object with the name of the dataset ▪ Avoid renames – When asked to create a temporary object, recognize the pattern of the name and writes the object directly to its final name – For example <dataset-name>/_temporary/0/_temporary/attempt_<job-timestamp>_0000_m_000000_<attempt-number>/part-<part-number> – Becomes <dataset-name>/part-<part-number>_attempt_<job-timestamp>_0000_m_000000_<attempt-number> ▪ When all tasks are done, write the _SUCCESS object Hadoop Filesystem Interface Stocator
  • 21. © 2017 IBM Corporation Writing to Object Storage using Stocator 1. Spark driver: create a ‘directory’ s3d:// vault1.service/data.txt 2. Spark driver: get container s3d:// vault1.service/data.txt 3. Spark executor: create output object: s3d:// vault1.service/data.txt/part-00000-attempt_201702231115_0001_m_000000_1000 4. Spark driver: write _SUCCESS object s3d://vault1.service/data.txt/_SUCCESS Simple code that writes a single file to storage: val data = Array(1) val distData = sc.parallelize(data) val finalData = distData.coalesce(1) finalData.saveAsTextFile(“s3d://vault1.service/data.txt")
  • 22. © 2017 IBM Corporation Wordcount using Stocator First, provide the credentials for the object store where the data resides:
  • 23. © 2017 IBM Corporation Wordcount using Stocator And now write your code to access and manipulate the data:
  • 24. © 2017 IBM Corporation Wordcount using Stocator Finally, this is our output:
  • 25. © 2017 IBM Corporation Performance evaluations ▪ Spark cluster – Spark 2.0.1 • 3 x Dual Intel Xeon E52690 (12 Hyper-threaded Cores, 2.60 GHz) • 256 GB RAM, 1 x 10Gbps, 1 x 1TB SATA ▪ Object store – IBM Cloud Object Storage cluster #Jobs #Stages Input size Output size Copy 1 1 50 GB 50 GB Read 1 1 50 GB -- Teragen 1 1 -- 50 GB Wordcount 1 2 50 GB 1.6 MB Terasort 2 4 50 GB 50 GB TPC-DS 112 179 50 GB (raw), 15 GB Parquet --
  • 26. © 2017 IBM Corporation Comparing Stocator to base Hadoop swift and s3a 0 100 200 300 400 500 600 700 800 Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS Seconds Stocator Hadoop Swift S3a 18x 10x 9x 2x 1x1x 1x* Stocator is • much faster for write workloads • about the same for read workloads * Comparing stocator to s3a
  • 27. © 2017 IBM Corporation Comparing Stocator to s3a with non default features 0 100 200 300 400 500 600 700 800 Teragen Copy Terasort Wordcount Read (50 GB) Read (500 GB) TPC-DS seconds Stocator S3a S3a CV2 S3a CV2+FU 1.5x 1.3x 1.3x 1.1x 1x1x 1x* * Comparing stocator to s3a commit version 2 + fast upload
  • 28. © 2017 IBM Corporation Comparing REST operations 0 5000 10000 15000 20000 25000 30000 35000 40000 45000 Teragen Copy Terasort Wordcount Read (50GB) Read (500GB) TPC-DS RESTfuloperations Stocator Hadoop Swift S3a 33x 25x 24x 25x 2x2x 2x* Stocator has a lower impact on the object store than s3a • Reduce cost • Reduce overhead * Comparing stocator to s3a
  • 29. © 2017 IBM Corporation Questions?
  • 30. © 2017 IBM Corporation Backup Slides
  • 31. © 2017 IBM Corporation Reading a dataset from Stocator ▪ Confirm that the dataset was produced by Stocator – Using the metadata from the initial ‘directory’ ▪ Confirm that the _SUCCESS object exists ▪ Lists the object parts belonging to the dataset – Using GET container RESTful call ▪ Are there multiple objects from different execution attempts? – Choose the one that has the most data – Given • fail-stop assumption (i.e. Spark server executes correctly until it halts) • All successful execution attempts write the same output • There are no in place updates in an object store • At least one attempt succeeded (evident by the _SUCCESS object)
  • 32. © 2017 IBM Corporation Additional optimizations ▪ Streaming of output – Normally, the length of an object is a parameter to a PUT operation – But this means Spark needs to cache the entire object prior to issuing PUT – Hadoop swift and s3a by default cache the object in a local file system – Stocator leverages HTTP chunked transfer encoding • Object is sent in chucked (64KB in stocator) • No need to know final object length before issuing the PUT – Similar to s3a fast upload • Minimum size of 5MB per part ▪ Avoid HEAD operation just before GET – Often the HEAD is used just to confirm that the object exists and determine its size – However GET also returns the object meta data – In many cases Stocator is able to avoid the extra HEAD call before a GET ▪ Cache the results of HEAD operations – Spark assumes input dataset is immutable