SlideShare a Scribd company logo
Alluxio 2.0 & Near Real-time
Analytics with Spark & Alluxio
@VipShop
Alluxio Bay Area Meetup
@alluxio alluxio.org/slack info@alluxio.com
Special thanks to AICamp and
ODSC for co-hosting!
Alluxio Bay Area Meetup
@alluxio alluxio.org/slack info@alluxio.com
Alluxio 2.0.0-preview
03/14 Alluxio Meetup
● Release Manager for Alluxio 2.0.0
● Contributor since Tachyon 0.4 (2012)
● Founding Engineer @ Alluxio
About Me
Calvin Jia
Alluxio Overview
• Open source, distributed storage system
• Commonly used for data analytics such as OLAP on Hadoop
• Deployed at Huya, Two Sigma, Tencent, and many others
• Largest deployments of over 1000 nodes
Java File API HDFS Interface S3 Interface REST APIPOSIX Interface
HDFS Driver Swift Driver S3 Driver NFS Driver
Agenda
Alluxio 2.0 Motivation1
Architectural Innovations2
Release Roadmap3
Alluxio 2.0 Motivations
Why 2.0
• Alluxio 1.x target use cases are largely addressed
• Three major types of feedback from users
• Want to support POSIX-based workloads, especially ML
• Want better options for data management
• Want to scale to larger clusters
Use Cases
Alluxio 1.x
• Burst compute into cloud with data
on-prem
• Enable object stores for data
analytics platforms
• Accelerate OLAP on Hadoop
Example
• As a data scientist, I want to be able
to spin up my own elastic compute
cluster that can easily and efficiently
access my data stores
New in Alluxio 2.x
• Enable ML/DL frameworks on object
stores
• Data lifecycle management and data
migration
Examples
• As a data scientist, I want to run my
existing simulations on larger
datasets stored in S3.
• As a data infrastructure engineer, I
want to automatically tier data
between Alluxio and the under store.
ML/DL Workloads
• Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP
on Hadoop
• Alluxio 2.x will continue to excel for these workloads
• New emphasis on ML frameworks such as Tensorflow
• Primarily accesses the same data set which Alluxio already is serving
• Challenges include new API and file characteristics, such as file access
pattern and file sizes
Data Management
• Finer grained control over Alluxio replication
• Automated and scalable async persistence
• Distributed data loading
• Mechanism for cross-mount data operations
Scaling
• Namespace scaling - scale to 1 billion files
• Cluster scaling - scale to 3000 worker nodes
• Client scaling - scale to 30,000 concurrent clients
Architectural Innovations
Architectural Innovations in 2.0
• Off heap metadata storage (namespace scaling)
• gRPC transport layer (cluster and client scaling)
• Improved POSIX API (new workloads)
• Job Service (enable data management)
• Embedded Journal and Internal Leader Election (better integration
with object stores, fewer external dependencies)
Off Heap Metadata Storage
• Uses an embedded RocksDB to store inode tree
• Internal cache for frequently used inodes
• Performance is comparable to previous on-heap option when
working set can fit in cache
gRPC Transport Layer
• Switch from Thrift (metadata) + Netty (data) transport to a
consolidated gRPC based transport
• Connection multiplexing to reduce the number of connections from
# of application threads to # of applications
• Threading model enables the master to serve concurrent requests
without being limited by internal threadpool size or open file
descriptors on the master
Improved POSIX API
• Alluxio FUSE based POSIX API
• Limitations such as no random write, file cannot be read until
complete
• Validated against Tensorflow’s image recognition and
recommendation workloads
• Taking suggestions for other POSIX-based workloads!
Job Service
• New process which serves as a lightweight computation framework
for Alluxio specific tasks
• Enables replication factor control without user input
• Enables faster loading/persisting of data in a distributed manner
• Allows users to do cross-mount operations
• Async through is handled automatically
Embedded Journal and Internal Leader Election
• New journaling service reliant only on Alluxio master processes
• No longer need an external distributed storage to store the journal
• Greatly benefits environments without a distributed file system
• Uses Raft as the consensus algorithm
• Consensus is used for journal integrity
• Consensus can also be used for leader election in high availability mode
Release Roadmap
Alluxio 2.0.0 Release
• Alluxio 2.0.0-preview is available now
• Any and all feedback is appreciated!
• File bugs and feature requests on our Github issues
• Alluxio 2.0.0 will be released in ~3 months
Questions?
Alluxio Website - www.alluxio.org
Alluxio Community Slack Channel - www.alluxio.org/slack
Alluxio Office Hours & Webinars - https://www.alluxio.org/resources/events
Wanchun Wang
Chief Data Architect
Near Real-time ETL
platform using Spark &
Alluxio
About VipShop
• A leading online discount retailer for brands in China
• $12.3 Billion net revenue in 2018
• 20M+ visitors/day
q - 5 -5 5
q 5 2 2 & 2 : 5
q 5 5 : 2 : 5
q 5 5 5 2 5 5 - 5- 22: 5
q : : 5
q 22 2 -5
26
Overview - Big Data systems
q Separate Streaming and Batch platforms, single data pre-
processing pipeline, no longer a pure Lambda architecture
q Typically streaming data get sinked into hive tables every 5
minutes
q More ETL jobs are moving toward Near Real Time
Lo
g
Kafka
Data
Cleansin
g
Kafka
Augmen
-tation
Kafka
Hive
Delta
Hive
Daily
Streaming(Storm/Flink/Spark)
Batch ETL
(Hive/Spark)
27
The process of identifying a set of user actions (“events”) across screens and touch
points that contribute in some manner to a product sale, and then assigning value to
each of these events.
front
today’s
new
man’s
special
Product A
detail
man’s
special
Product B
detail
add cartorder
Near Real-time sales attribution
is a very complex process
• Recompute full day’s data at each iteration:
• ~ 30 minutes, worst case 2-3 hours
• Many data sources involved:
• page view, add cart, order_with_discount, order_cookie_map, sub_order, prepay_order_goods etc
• Several large data sources each contain billions of records and take up 300GB ~
800GB space on Disk
• Sales Path assignment is very CPU intensive computation
• Written by business analysts
• Complex SQL scripts with UDF functions
Business expectation: updated result every 5 - 15 minutes
29
+ + ++
+
Running performance sensitive jobs on current batch platform
not an option
• Around 200K batch jobs executed daily in Hadoop & Spark clusters
• Hdfs 1400+ nodes
• SSD hdfs 50+ nodes
• Spark Clusters( 300+ nodes)
• Cluster usage is above 80% at normal days, resources are even more saturated
during monthly promotion period
• Many issues contribute to the Inconsistent data access time such as NN RPC too
high, slow DataNode response etc
• Scheduling overhead when running M/R jobs
30
1. Adding more compute power
• Too expensive - Not a real option
2. Improve ETL job to process updates incrementally
3. Create a new, relatively isolated environment
• consistent computing resource allocation
• intermediate data caching
• faster read/write
• Recompute the click paths for the active users in current window
• Merge active user paths with previous full path result
• Less data in computation but one more read on history data
2.Improve ETL Job to process
updates incrementally
32
,
33
q A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores,
256G Memory)
q Alluxio colocated with Spark
qVery consistent read/write I/O time over iterations
q Alluxio Mem + HDD
qDisable multiple copies to save space
qLeave enough memory to OS, improve stability
34
A. Remote HDFS cluster: 1-2 times slow than Alluxio, the biggest problem is there are lots of
spikes
B. Use local HDFS, 30%-100% slower than Alluxio ( Mem + HDD)
C. On dedicated SSD cluster
• on par with Alluxio in regular days, but overall read/write latency doubled during busy days
D. On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done)
E. Spark Cache
• Our daily views, clicks and path result are too big to fit into JVM
• Slow to create and we have lots of “only used twice” data
• Multiple downstream spark apps need to share the data
35
L
q Move the downstream processes closer to the data, avoid duplicating large amount of
data from Alluxio to remote HDFS
q Manage NRT jobs
q A single big Spark Streaming job? too many inputs and outputs at different stages
q Split into multiple jobs? how to coordinate multiple stream jobs
q NRT executed in much higher frequency, very sensitive to system hiccups
q Current batch job scheduling
q Process dependency, executed for every fixed interval
q When there is a severe delay, multiple batch instances for different slot running at
the same time
36
q Report data readiness to Watermark Service, manage dependency
between loosely coupled jobs
q Ultimate goal is get the latest result fast
q a delayed batch might consume the unprocessed input blocks span
over multiple cycles.
q Output for fix intervals is not guaranteed
q not all inputs are mandatory, iteration get kicked off even when
optional input sources are not update for that particular cycle
37
• Easy to setup
• Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx
• Together with Spark, either form a separated satellite cluster or on label machines in
our big clusters
• Within our Data Centers, it is easier to allocate computing resources but SSD
machines are scare
• Spark and Alluxio on K8S: Over 1k machines, we need shuffle those machines to
run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different
time of a day
• Very stable in production
• Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!
38
• Async persistent to remote HDFS
• Avoid duplicated write in user code/SQL,
• Put hadoop /tmp/ directory on Alluxio over SSD, reduce
NN rpc and load on DN
• Cache hot/warm data for Presto, Heavy traffic and ad hoc
query is very sensitive to HDFS stability
Questions?
Alluxio Bay Area Meetup
@alluxio alluxio.org/slack info@alluxio.com
WeChat

More Related Content

What's hot

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
Alluxio, Inc.
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
Alluxio, Inc.
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
Alluxio, Inc.
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
Alluxio, Inc.
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
Alluxio, Inc.
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
Alluxio, Inc.
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
Alluxio, Inc.
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
Alluxio, Inc.
 
The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.com
Alluxio, Inc.
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Alluxio, Inc.
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
Alluxio, Inc.
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Alluxio, Inc.
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
Alluxio, Inc.
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Alluxio, Inc.
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
Alluxio, Inc.
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Alluxio, Inc.
 

What's hot (20)

Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Improving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTokImproving Presto performance with Alluxio at TikTok
Improving Presto performance with Alluxio at TikTok
 
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-sensitive queries for Presto at Facebook: A Collaboration ...
 
Achieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud WorldAchieving Separation of Compute and Storage in a Cloud World
Achieving Separation of Compute and Storage in a Cloud World
 
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and PrestoStorageQuery: federated querying on object stores, powered by Alluxio and Presto
StorageQuery: federated querying on object stores, powered by Alluxio and Presto
 
Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3Accelerating Hive with Alluxio on S3
Accelerating Hive with Alluxio on S3
 
From limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiencyFrom limited Hadoop compute capacity to increased data scientist efficiency
From limited Hadoop compute capacity to increased data scientist efficiency
 
Burst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copiesBurst Presto & Spark workloads to AWS EMR with no data copies
Burst Presto & Spark workloads to AWS EMR with no data copies
 
How to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data PlatformsHow to Develop and Operate Cloud First Data Platforms
How to Develop and Operate Cloud First Data Platforms
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Alluxio Use Cases and Future Directions
Alluxio Use Cases and Future DirectionsAlluxio Use Cases and Future Directions
Alluxio Use Cases and Future Directions
 
Best Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with SparkBest Practices for Using Alluxio with Spark
Best Practices for Using Alluxio with Spark
 
The Practice of Alluxio in JD.com
The Practice of Alluxio in JD.comThe Practice of Alluxio in JD.com
The Practice of Alluxio in JD.com
 
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
Optimizing Latency-Sensitive Queries for Presto at Facebook: A Collaboration ...
 
RaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cacheRaptorX: Building a 10X Faster Presto with hierarchical cache
RaptorX: Building a 10X Faster Presto with hierarchical cache
 
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
Enterprise Distributed Query Service powered by Presto & Alluxio across cloud...
 
Presto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On LabPresto on Alluxio Hands-On Lab
Presto on Alluxio Hands-On Lab
 
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with AlluxioSecurely Enhancing Data Access in Hybrid Cloud with Alluxio
Securely Enhancing Data Access in Hybrid Cloud with Alluxio
 
Accelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud EraAccelerate Analytics and ML in the Hybrid Cloud Era
Accelerate Analytics and ML in the Hybrid Cloud Era
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 

Similar to Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Alluxio, Inc.
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Alluxio, Inc.
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
Peter Clapham
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
Peter Clapham
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
Peter Clapham
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Alluxio, Inc.
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Alluxio, Inc.
 
Nuxeo Platform LTS 2015 Highlights
Nuxeo Platform LTS 2015 HighlightsNuxeo Platform LTS 2015 Highlights
Nuxeo Platform LTS 2015 Highlights
Nuxeo
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
Ramsay Key
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...Splunk
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
Dirk Petersen
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
sabnees
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
Alluxio, Inc.
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionLucidworks
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
Alluxio, Inc.
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
DataWorks Summit
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
OpenEBS
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 

Similar to Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio (20)

Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
Introduction to Alluxio (formerly Tachyon) and how it brings up to 300x perfo...
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stackAccelerating analytics in the cloud with the Starburst Presto + Alluxio stack
Accelerating analytics in the cloud with the Starburst Presto + Alluxio stack
 
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the CloudInteractive Analytics with the Starburst Presto + Alluxio stack for the Cloud
Interactive Analytics with the Starburst Presto + Alluxio stack for the Cloud
 
Nuxeo Platform LTS 2015 Highlights
Nuxeo Platform LTS 2015 HighlightsNuxeo Platform LTS 2015 Highlights
Nuxeo Platform LTS 2015 Highlights
 
How to Build a Compute Cluster
How to Build a Compute ClusterHow to Build a Compute Cluster
How to Build a Compute Cluster
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
SplunkLive! Nutanix Session - Turnkey and scalable infrastructure for Splunk ...
 
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data B...
 
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and InfrastrctureRevolutionary Storage for Modern Databases, Applications and Infrastrcture
Revolutionary Storage for Modern Databases, Applications and Infrastrcture
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Webinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with FusionWebinar: Faster Log Indexing with Fusion
Webinar: Faster Log Indexing with Fusion
 
Unified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any CloudUnified Big Data Analytics: Any Stack, Any Cloud
Unified Big Data Analytics: Any Stack, Any Cloud
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
Oracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_databaseOracle OpenWo2014 review part 03 three_paa_s_database
Oracle OpenWo2014 review part 03 three_paa_s_database
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Alluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
Alluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
CatarinaPereira64715
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
ODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User GroupODC, Data Fabric and Architecture User Group
ODC, Data Fabric and Architecture User Group
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 

Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio

  • 1. Alluxio 2.0 & Near Real-time Analytics with Spark & Alluxio @VipShop Alluxio Bay Area Meetup @alluxio alluxio.org/slack info@alluxio.com
  • 2. Special thanks to AICamp and ODSC for co-hosting! Alluxio Bay Area Meetup @alluxio alluxio.org/slack info@alluxio.com
  • 4. ● Release Manager for Alluxio 2.0.0 ● Contributor since Tachyon 0.4 (2012) ● Founding Engineer @ Alluxio About Me Calvin Jia
  • 5. Alluxio Overview • Open source, distributed storage system • Commonly used for data analytics such as OLAP on Hadoop • Deployed at Huya, Two Sigma, Tencent, and many others • Largest deployments of over 1000 nodes Java File API HDFS Interface S3 Interface REST APIPOSIX Interface HDFS Driver Swift Driver S3 Driver NFS Driver
  • 6. Agenda Alluxio 2.0 Motivation1 Architectural Innovations2 Release Roadmap3
  • 8. Why 2.0 • Alluxio 1.x target use cases are largely addressed • Three major types of feedback from users • Want to support POSIX-based workloads, especially ML • Want better options for data management • Want to scale to larger clusters
  • 9. Use Cases Alluxio 1.x • Burst compute into cloud with data on-prem • Enable object stores for data analytics platforms • Accelerate OLAP on Hadoop Example • As a data scientist, I want to be able to spin up my own elastic compute cluster that can easily and efficiently access my data stores New in Alluxio 2.x • Enable ML/DL frameworks on object stores • Data lifecycle management and data migration Examples • As a data scientist, I want to run my existing simulations on larger datasets stored in S3. • As a data infrastructure engineer, I want to automatically tier data between Alluxio and the under store.
  • 10. ML/DL Workloads • Alluxio 1.x focuses primarily on Hadoop based workloads, ie. OLAP on Hadoop • Alluxio 2.x will continue to excel for these workloads • New emphasis on ML frameworks such as Tensorflow • Primarily accesses the same data set which Alluxio already is serving • Challenges include new API and file characteristics, such as file access pattern and file sizes
  • 11. Data Management • Finer grained control over Alluxio replication • Automated and scalable async persistence • Distributed data loading • Mechanism for cross-mount data operations
  • 12. Scaling • Namespace scaling - scale to 1 billion files • Cluster scaling - scale to 3000 worker nodes • Client scaling - scale to 30,000 concurrent clients
  • 14. Architectural Innovations in 2.0 • Off heap metadata storage (namespace scaling) • gRPC transport layer (cluster and client scaling) • Improved POSIX API (new workloads) • Job Service (enable data management) • Embedded Journal and Internal Leader Election (better integration with object stores, fewer external dependencies)
  • 15. Off Heap Metadata Storage • Uses an embedded RocksDB to store inode tree • Internal cache for frequently used inodes • Performance is comparable to previous on-heap option when working set can fit in cache
  • 16. gRPC Transport Layer • Switch from Thrift (metadata) + Netty (data) transport to a consolidated gRPC based transport • Connection multiplexing to reduce the number of connections from # of application threads to # of applications • Threading model enables the master to serve concurrent requests without being limited by internal threadpool size or open file descriptors on the master
  • 17. Improved POSIX API • Alluxio FUSE based POSIX API • Limitations such as no random write, file cannot be read until complete • Validated against Tensorflow’s image recognition and recommendation workloads • Taking suggestions for other POSIX-based workloads!
  • 18. Job Service • New process which serves as a lightweight computation framework for Alluxio specific tasks • Enables replication factor control without user input • Enables faster loading/persisting of data in a distributed manner • Allows users to do cross-mount operations • Async through is handled automatically
  • 19. Embedded Journal and Internal Leader Election • New journaling service reliant only on Alluxio master processes • No longer need an external distributed storage to store the journal • Greatly benefits environments without a distributed file system • Uses Raft as the consensus algorithm • Consensus is used for journal integrity • Consensus can also be used for leader election in high availability mode
  • 21. Alluxio 2.0.0 Release • Alluxio 2.0.0-preview is available now • Any and all feedback is appreciated! • File bugs and feature requests on our Github issues • Alluxio 2.0.0 will be released in ~3 months
  • 22. Questions? Alluxio Website - www.alluxio.org Alluxio Community Slack Channel - www.alluxio.org/slack Alluxio Office Hours & Webinars - https://www.alluxio.org/resources/events
  • 23. Wanchun Wang Chief Data Architect Near Real-time ETL platform using Spark & Alluxio
  • 24. About VipShop • A leading online discount retailer for brands in China • $12.3 Billion net revenue in 2018 • 20M+ visitors/day
  • 25. q - 5 -5 5 q 5 2 2 & 2 : 5 q 5 5 : 2 : 5 q 5 5 5 2 5 5 - 5- 22: 5 q : : 5 q 22 2 -5
  • 26. 26 Overview - Big Data systems q Separate Streaming and Batch platforms, single data pre- processing pipeline, no longer a pure Lambda architecture q Typically streaming data get sinked into hive tables every 5 minutes q More ETL jobs are moving toward Near Real Time Lo g Kafka Data Cleansin g Kafka Augmen -tation Kafka Hive Delta Hive Daily Streaming(Storm/Flink/Spark) Batch ETL (Hive/Spark)
  • 27. 27 The process of identifying a set of user actions (“events”) across screens and touch points that contribute in some manner to a product sale, and then assigning value to each of these events. front today’s new man’s special Product A detail man’s special Product B detail add cartorder
  • 28. Near Real-time sales attribution is a very complex process • Recompute full day’s data at each iteration: • ~ 30 minutes, worst case 2-3 hours • Many data sources involved: • page view, add cart, order_with_discount, order_cookie_map, sub_order, prepay_order_goods etc • Several large data sources each contain billions of records and take up 300GB ~ 800GB space on Disk • Sales Path assignment is very CPU intensive computation • Written by business analysts • Complex SQL scripts with UDF functions Business expectation: updated result every 5 - 15 minutes
  • 29. 29 + + ++ + Running performance sensitive jobs on current batch platform not an option • Around 200K batch jobs executed daily in Hadoop & Spark clusters • Hdfs 1400+ nodes • SSD hdfs 50+ nodes • Spark Clusters( 300+ nodes) • Cluster usage is above 80% at normal days, resources are even more saturated during monthly promotion period • Many issues contribute to the Inconsistent data access time such as NN RPC too high, slow DataNode response etc • Scheduling overhead when running M/R jobs
  • 30. 30 1. Adding more compute power • Too expensive - Not a real option 2. Improve ETL job to process updates incrementally 3. Create a new, relatively isolated environment • consistent computing resource allocation • intermediate data caching • faster read/write
  • 31. • Recompute the click paths for the active users in current window • Merge active user paths with previous full path result • Less data in computation but one more read on history data 2.Improve ETL Job to process updates incrementally
  • 32. 32 ,
  • 33. 33 q A Satellite Spark + Alluxio 1.8.1 cluster with 27 nodes (48 cores, 256G Memory) q Alluxio colocated with Spark qVery consistent read/write I/O time over iterations q Alluxio Mem + HDD qDisable multiple copies to save space qLeave enough memory to OS, improve stability
  • 34. 34 A. Remote HDFS cluster: 1-2 times slow than Alluxio, the biggest problem is there are lots of spikes B. Use local HDFS, 30%-100% slower than Alluxio ( Mem + HDD) C. On dedicated SSD cluster • on par with Alluxio in regular days, but overall read/write latency doubled during busy days D. On dedicated Alluxio cluster, still not as good as co-located setup ( more test to be done) E. Spark Cache • Our daily views, clicks and path result are too big to fit into JVM • Slow to create and we have lots of “only used twice” data • Multiple downstream spark apps need to share the data
  • 35. 35 L q Move the downstream processes closer to the data, avoid duplicating large amount of data from Alluxio to remote HDFS q Manage NRT jobs q A single big Spark Streaming job? too many inputs and outputs at different stages q Split into multiple jobs? how to coordinate multiple stream jobs q NRT executed in much higher frequency, very sensitive to system hiccups q Current batch job scheduling q Process dependency, executed for every fixed interval q When there is a severe delay, multiple batch instances for different slot running at the same time
  • 36. 36 q Report data readiness to Watermark Service, manage dependency between loosely coupled jobs q Ultimate goal is get the latest result fast q a delayed batch might consume the unprocessed input blocks span over multiple cycles. q Output for fix intervals is not guaranteed q not all inputs are mandatory, iteration get kicked off even when optional input sources are not update for that particular cycle
  • 37. 37 • Easy to setup • Pluggable, just a simple switch from hdfs://xxxx to alluxio://xxxx • Together with Spark, either form a separated satellite cluster or on label machines in our big clusters • Within our Data Centers, it is easier to allocate computing resources but SSD machines are scare • Spark and Alluxio on K8S: Over 1k machines, we need shuffle those machines to run Streaming, Spark ETL,Presto Ad Hoc Query or ML at different days or different time of a day • Very stable in production • Over 2 and a half years without any major issue. A big thank to Alluxio Engineers!
  • 38. 38 • Async persistent to remote HDFS • Avoid duplicated write in user code/SQL, • Put hadoop /tmp/ directory on Alluxio over SSD, reduce NN rpc and load on DN • Cache hot/warm data for Presto, Heavy traffic and ad hoc query is very sensitive to HDFS stability
  • 39. Questions? Alluxio Bay Area Meetup @alluxio alluxio.org/slack info@alluxio.com WeChat