SlideShare a Scribd company logo
Gianmario Spacagna
14th September, 2019 - Alluxio Meetup @ San Francisco, CA
Takeaways
¨  What a logical data warehouse is
¨  How to handle governance issues
¨  An Agile workflow made of iterative exploratory
analysis and production-quality development
¨  A fully in-memory stack for fast computation on top
of Spark and Alluxio
¨  How to successfully do data science if your data
resides in a RDBMS and you don’t have a data
lake
About me
¨  Engineering background in Distributed Systems
¤  (University of Cassino, Polytechnic of Turin, KTH of Stockholm)
¨  Data-relevant experience
¤  Predictive Marketing (AgilOne, StreamSend)
¤  Cyber Security (Cisco)
¤  Financial Services (Barclays)
¤  Automotive (Pirelli) ç
Areas of interest
¨  Functional Programming, Scala and Apache
Spark
¨  Contributor of the
Professional Data Science Manifesto
¨  Founder of Data Science Milan Meetup
community (datasciencemilan.org)
¨  Co-authoring Python Deep Learning book,
coming soon…
Building production-ready and scalable machine
learning systems
(continue with list of principles...)
Data Science Agile cycle
Get
access to
data
Explore
TransformTrain
Evaluate
Analyze
results
Even dozens of
iterations per
day!!!
Successful development
of new data products
requires proper
infrastructure and tools
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
¨  You can’t solve problems with data science if
data is not largely available
¨  Data processing should be fast and reactive to
allow quick iterations
¨  The core team cannot depend on IT folks
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your organization
for cluster resources
Data Lake in a legacy enterprise
environment
Technical issues
¨  Engineering effort
¤  dedicated infrastructure team (expensive)
¨  Synchronization with new data from source
¤  Report what portion of data has been exported and what
not
¨  Consistency / Data Versioning / Duplication
¤  ETL logic and requirements change very often
¤  Memory is cheap but when you have hundreds of sparse
copies of same data is confusing
¨  I/O cost
¤  Reading/writing is expensive for iterative and explorative
jobs (machine learning)
Logical Data Warehouse
¨  View and access cleaned versions of data
¨  Always show latest version by default
¨  Apply transformations on-the-fly
(discovery-oriented analytics)
¨  Abstract data representation from rigid structures
of the DB’s persistence store
¨  Simply add new data sources using virtualization
¨  Flexible, fast time-to-market, lower costs
What about governance issues?
¨  Large corporations can’t move data before an approved
governance plan
¨  Data can only be stored in a safe environment
administered by only a few authorized people who don’t
necessary understand data scientists needs
¨  Data leakage paranoia, cloud-phobia!
¨  As result, data cannot be easily/quickly pulled from the
central data warehouse and stored into an external
infrastructure
Long time and large investment for
setting up a new project
That’s not Agile!
Wait a moment, analysts don’t seem to
have this problem…
From disk to volatile memory
Distribute and make data temporary available in-
memory in an ad-hoc development cluster
¨  In-memory engine for distributed data processing
¨  JDBC drivers to connect to relational databases
¨  Structured data represented using DataFrame API
¨  Fully-functional data manipulation via RDD API
¨  Machine learning libraries (ML/MLllib)
¨  Interaction and visualization through
Spark Notebook or Zeppelin
In-memory workflow
Just Spark cache is not enough
¨  Data is dropped from memory
at each context restart due to
¤  Update dependency jar
(common for mixed IDE
development / notebook analysis)
¤  Re-submit the job execution
¤  Kerberos ticket expires L
¨  Fetching 600M rows can take
~ 1 hour in a 5 nodes cluster
Dozens iterations per day => spending most of the time
waiting for data to reload at each iteration!
Distribute and make data temporary persistently
available in-memory in the development cluster and
shared among multiple concurrent applications
From volatile memory to persistent
memory storage
¨  Formerly known as Tachyon
¨  In-memory distributed storage system
¨  Long-term caching of raw data and intermediate
results
¨  Spark can read/write in Alluxio seamlessly instead
of using HDFS
¨  1-tier configuration safely leaves no traces to disk
¨  Data is loaded once and available for the whole
development period to multiple applications
Alluxio as the Key Enabling Technology
1-tier configuration
¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk
¨  alluxio.worker.memory.size=24GB
¨  alluxio.worker.tieredstore
¤  levels=1
¤  level0.alias=MEM
¤  level0.dirs.path=${ALLUXIO_RAM_FOLDER}
¤  level0.dirs.quota=24G
¨  We leave empty the under FS configuration
¨  Deploy without mount (no root access required)
¤  ./bin/alluxio-start.sh all NoMount
Spark read/write APIs
¨  DataFrame
¤  dataframe.write.save(”alluxio://master_ip:port/mydata/
mydataframe.parquet")
¤  val dataframe: DataFrame = sqlContext.read.load(”alluxio://
master_ip:port/mydata/mydataframe.parquet")
¨  RDD
¤  rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object")
¤  val rdd: RDD[MyCaseClass] = sc.objectFile[MyCaseClass] (”alluxio://
master_ip:port/mydata/myrdd.object")
Making the impossible possible
¨  Agile workflow combining Spark, Scala, DataFrame,
JDBC, Parquet, Kryo and Alluxio to create a
scalable, in-memory, reactive stack to explore data
directly from source and develop production-quality
machine learning pipelines
¨  Data available since day 1 and at every iteration
¤  Alluxio decreased loading time from hours to seconds
¨  Avoid complicated and time-consuming Data
Plumbing operations
Further developments
1.  Memory size limitation
¤  Add external in-memory tiers?
2.  Set-up overhead
¤  JDBC drivers, partitioning strategy and data frame from/to case
class conversion (Spark 2 aims to solve this)
3.  Shared memory resources between Spark and Alluxio
¤  Set Alluxio as OFF_HEAP memory as well and divide memory in
storage and cache
4.  In-Memory replication for read availability
¤  If an Alluxio node fails, data is lost due the absence of an
underlying file system
5.  Would be nice if Alluxio could handle this and mount a
relational table/view in the form of data files
(csv, parquet…)
Follow-up links
¨  Original article on DZone:
¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-
from-Hours-to-Seconds-With-Tachyon
¨  Professional Data Science Manifesto:
¤  datasciencemanifesto.org
¨  Vademecum of Practical Data Science:
¤  datasciencevademecum.wordpress.com
¨  Sparkz
¤  github.com/gm-spacagna/sparkz

More Related Content

What's hot

Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
Jiří Šimša
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
Thai Bui
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
Guang Xu
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
Sascha Dittmann
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
Gianmario Spacagna
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
Anurag S
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
ScyllaDB
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
Andraz Tori
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
DataStax
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
datastack
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
DataStax
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio, Inc.
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio, Inc.
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
Alluxio, Inc.
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio, Inc.
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Tachyon Nexus, Inc.
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Athiq Ahamed
 

What's hot (20)

Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016Alluxio Presentation at Strata San Jose 2016
Alluxio Presentation at Strata San Jose 2016
 
Hybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxioHybrid collaborative tiered storage with alluxio
Hybrid collaborative tiered storage with alluxio
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Microsoft R - Data Science at Scale
Microsoft R - Data Science at ScaleMicrosoft R - Data Science at Scale
Microsoft R - Data Science at Scale
 
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
In-Memory Logical Data Warehouse for accelerating Machine Learning Pipelines ...
 
Bigdata antipatterns
Bigdata antipatternsBigdata antipatterns
Bigdata antipatterns
 
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
MongoDB vs Scylla: Production Experience from Both Dev & Ops Standpoint at Nu...
 
SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!SQL or NoSQL, that is the question!
SQL or NoSQL, that is the question!
 
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
Develop Scalable Applications with DataStax Drivers (Alex Popescu, Bulat Shak...
 
سکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابرسکوهای ابری و مدل های برنامه نویسی در ابر
سکوهای ابری و مدل های برنامه نویسی در ابر
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
Building Data Pipelines with SMACK: Designing Storage Strategies for Scale an...
 
Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016Alluxio Keynote at Strata+Hadoop World Beijing 2016
Alluxio Keynote at Strata+Hadoop World Beijing 2016
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 
Presto
PrestoPresto
Presto
 
Enabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with AlluxioEnabling Presto Caching at Uber with Alluxio
Enabling Presto Caching at Uber with Alluxio
 
Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18Alluxio: Unify Data at Memory Speed; 2016-11-18
Alluxio: Unify Data at Memory Speed; 2016-11-18
 
Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015Presentation by TachyonNexus & Intel at Strata Singapore 2015
Presentation by TachyonNexus & Intel at Strata Singapore 2015
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDBBenchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
Benchmarking Top NoSQL Databases: Apache Cassandra, Apache HBase and MongoDB
 

Similar to Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupGianmario Spacagna
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
Ike Ellis
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
Databricks
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Anant Corporation
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
Dan Lynn
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
Data Con LA
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud
RightScale
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
Djamel Zouaoui
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
Edureka!
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
Aerospike, Inc.
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
Databricks
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
Bogdan Dina
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Databricks
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
BoldRadius Solutions
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
Infinity Tech Solutions
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Demi Ben-Ari
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
Demi Ben-Ari
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
Spark Summit
 

Similar to Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016 (20)

Logical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetupLogical-DataWarehouse-Alluxio-meetup
Logical-DataWarehouse-Alluxio-meetup
 
Azure Databricks is Easier Than You Think
Azure Databricks is Easier Than You ThinkAzure Databricks is Easier Than You Think
Azure Databricks is Easier Than You Think
 
A look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutionsA look under the hood at Apache Spark's API and engine evolutions
A look under the hood at Apache Spark's API and engine evolutions
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Data Streaming Technology Overview
Data Streaming Technology OverviewData Streaming Technology Overview
Data Streaming Technology Overview
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud Scaling SQL and NoSQL Databases in the Cloud
Scaling SQL and NoSQL Databases in the Cloud
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Spark For Faster Batch Processing
Spark For Faster Batch ProcessingSpark For Faster Batch Processing
Spark For Faster Batch Processing
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
Spark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production usersSpark Summit EU 2015: Lessons from 300+ production users
Spark Summit EU 2015: Lessons from 300+ production users
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right VMworld 2013: Virtualizing Databases: Doing IT Right
VMworld 2013: Virtualizing Databases: Doing IT Right
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
Scala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadiusScala Days Highlights | BoldRadius
Scala Days Highlights | BoldRadius
 
Bds session 13 14
Bds session 13 14Bds session 13 14
Bds session 13 14
 
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek AlumniSpark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
Spark 101 – First Steps To Distributed Computing - Demi Ben-Ari @ Ofek Alumni
 
Bring the Spark To Your Eyes
Bring the Spark To Your EyesBring the Spark To Your Eyes
Bring the Spark To Your Eyes
 
Introduction to Spark Training
Introduction to Spark TrainingIntroduction to Spark Training
Introduction to Spark Training
 

More from Alluxio, Inc.

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
Alluxio, Inc.
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
Alluxio, Inc.
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
Alluxio, Inc.
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio, Inc.
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
Alluxio, Inc.
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
Alluxio, Inc.
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
Alluxio, Inc.
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio, Inc.
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio, Inc.
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Alluxio, Inc.
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Alluxio, Inc.
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
Alluxio, Inc.
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
Alluxio, Inc.
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio, Inc.
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
Alluxio, Inc.
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
Alluxio, Inc.
 

More from Alluxio, Inc. (20)

AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
AI/ML Infra Meetup | Improve Speed and GPU Utilization for Model Training & S...
 
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-CloudAlluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
Alluxio Monthly Webinar | Simplify Data Access for AI in Multi-Cloud
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Optimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with AlluxioOptimizing Data Access for Analytics And AI with Alluxio
Optimizing Data Access for Analytics And AI with Alluxio
 
Speed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio CachingSpeed Up Presto at Uber with Alluxio Caching
Speed Up Presto at Uber with Alluxio Caching
 
Correctly Loading Incremental Data at Scale
Correctly Loading Incremental Data at ScaleCorrectly Loading Incremental Data at Scale
Correctly Loading Incremental Data at Scale
 
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/MLBig Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
Big Data Bellevue Meetup | Enhancing Python Data Loading in the Cloud for AI/ML
 
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
Alluxio Monthly Webinar | Why a Multi-Cloud Strategy Matters for Your AI Plat...
 
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...Alluxio Monthly Webinar | Five Disruptive Trends that Every  Data & AI Leader...
Alluxio Monthly Webinar | Five Disruptive Trends that Every Data & AI Leader...
 
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache EvictionData Infra Meetup | FIFO Queues are All You Need for Cache Eviction
Data Infra Meetup | FIFO Queues are All You Need for Cache Eviction
 
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio EdgeData Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
Data Infra Meetup | Accelerate Your Trino/Presto Queries - Gain the Alluxio Edge
 
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the CloudData Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
Data Infra Meetup | Accelerate Distributed PyTorch/Ray Workloads in the Cloud
 
Data Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet ReaderData Infra Meetup | ByteDance's Native Parquet Reader
Data Infra Meetup | ByteDance's Native Parquet Reader
 
Data Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage EvolutionData Infra Meetup | Uber's Data Storage Evolution
Data Infra Meetup | Uber's Data Storage Evolution
 
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
Alluxio Monthly Webinar | Why NFS/NAS on Object Storage May Not Solve Your AI...
 
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
AI Infra Day | Accelerate Your Model Training and Serving with Distributed Ca...
 
AI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI EraAI Infra Day | The AI Infra in the Generative AI Era
AI Infra Day | The AI Infra in the Generative AI Era
 

Recently uploaded

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
UiPathCommunity
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
Pierluigi Pugliese
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
Peter Spielvogel
 

Recently uploaded (20)

Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
Le nuove frontiere dell'AI nell'RPA con UiPath Autopilot™
 
A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024By Design, not by Accident - Agile Venture Bolzano 2024
By Design, not by Accident - Agile Venture Bolzano 2024
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdfSAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
SAP Sapphire 2024 - ASUG301 building better apps with SAP Fiori.pdf
 

Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

  • 1. Gianmario Spacagna 14th September, 2019 - Alluxio Meetup @ San Francisco, CA
  • 2. Takeaways ¨  What a logical data warehouse is ¨  How to handle governance issues ¨  An Agile workflow made of iterative exploratory analysis and production-quality development ¨  A fully in-memory stack for fast computation on top of Spark and Alluxio ¨  How to successfully do data science if your data resides in a RDBMS and you don’t have a data lake
  • 3. About me ¨  Engineering background in Distributed Systems ¤  (University of Cassino, Polytechnic of Turin, KTH of Stockholm) ¨  Data-relevant experience ¤  Predictive Marketing (AgilOne, StreamSend) ¤  Cyber Security (Cisco) ¤  Financial Services (Barclays) ¤  Automotive (Pirelli) ç
  • 4. Areas of interest ¨  Functional Programming, Scala and Apache Spark ¨  Contributor of the Professional Data Science Manifesto ¨  Founder of Data Science Milan Meetup community (datasciencemilan.org) ¨  Co-authoring Python Deep Learning book, coming soon… Building production-ready and scalable machine learning systems (continue with list of principles...)
  • 5. Data Science Agile cycle Get access to data Explore TransformTrain Evaluate Analyze results Even dozens of iterations per day!!!
  • 6. Successful development of new data products requires proper infrastructure and tools
  • 7. Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  • 8. ¨  You can’t solve problems with data science if data is not largely available ¨  Data processing should be fast and reactive to allow quick iterations ¨  The core team cannot depend on IT folks Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  • 9. Data Lake in a legacy enterprise environment
  • 10. Technical issues ¨  Engineering effort ¤  dedicated infrastructure team (expensive) ¨  Synchronization with new data from source ¤  Report what portion of data has been exported and what not ¨  Consistency / Data Versioning / Duplication ¤  ETL logic and requirements change very often ¤  Memory is cheap but when you have hundreds of sparse copies of same data is confusing ¨  I/O cost ¤  Reading/writing is expensive for iterative and explorative jobs (machine learning)
  • 11. Logical Data Warehouse ¨  View and access cleaned versions of data ¨  Always show latest version by default ¨  Apply transformations on-the-fly (discovery-oriented analytics) ¨  Abstract data representation from rigid structures of the DB’s persistence store ¨  Simply add new data sources using virtualization ¨  Flexible, fast time-to-market, lower costs
  • 12. What about governance issues? ¨  Large corporations can’t move data before an approved governance plan ¨  Data can only be stored in a safe environment administered by only a few authorized people who don’t necessary understand data scientists needs ¨  Data leakage paranoia, cloud-phobia! ¨  As result, data cannot be easily/quickly pulled from the central data warehouse and stored into an external infrastructure
  • 13. Long time and large investment for setting up a new project That’s not Agile!
  • 14. Wait a moment, analysts don’t seem to have this problem…
  • 15. From disk to volatile memory Distribute and make data temporary available in- memory in an ad-hoc development cluster
  • 16. ¨  In-memory engine for distributed data processing ¨  JDBC drivers to connect to relational databases ¨  Structured data represented using DataFrame API ¨  Fully-functional data manipulation via RDD API ¨  Machine learning libraries (ML/MLllib) ¨  Interaction and visualization through Spark Notebook or Zeppelin
  • 18. Just Spark cache is not enough ¨  Data is dropped from memory at each context restart due to ¤  Update dependency jar (common for mixed IDE development / notebook analysis) ¤  Re-submit the job execution ¤  Kerberos ticket expires L ¨  Fetching 600M rows can take ~ 1 hour in a 5 nodes cluster Dozens iterations per day => spending most of the time waiting for data to reload at each iteration!
  • 19. Distribute and make data temporary persistently available in-memory in the development cluster and shared among multiple concurrent applications From volatile memory to persistent memory storage
  • 20. ¨  Formerly known as Tachyon ¨  In-memory distributed storage system ¨  Long-term caching of raw data and intermediate results ¨  Spark can read/write in Alluxio seamlessly instead of using HDFS ¨  1-tier configuration safely leaves no traces to disk ¨  Data is loaded once and available for the whole development period to multiple applications
  • 21. Alluxio as the Key Enabling Technology
  • 22.
  • 23. 1-tier configuration ¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk ¨  alluxio.worker.memory.size=24GB ¨  alluxio.worker.tieredstore ¤  levels=1 ¤  level0.alias=MEM ¤  level0.dirs.path=${ALLUXIO_RAM_FOLDER} ¤  level0.dirs.quota=24G ¨  We leave empty the under FS configuration ¨  Deploy without mount (no root access required) ¤  ./bin/alluxio-start.sh all NoMount
  • 24. Spark read/write APIs ¨  DataFrame ¤  dataframe.write.save(”alluxio://master_ip:port/mydata/ mydataframe.parquet") ¤  val dataframe: DataFrame = sqlContext.read.load(”alluxio:// master_ip:port/mydata/mydataframe.parquet") ¨  RDD ¤  rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object") ¤  val rdd: RDD[MyCaseClass] = sc.objectFile[MyCaseClass] (”alluxio:// master_ip:port/mydata/myrdd.object")
  • 25. Making the impossible possible ¨  Agile workflow combining Spark, Scala, DataFrame, JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop production-quality machine learning pipelines ¨  Data available since day 1 and at every iteration ¤  Alluxio decreased loading time from hours to seconds ¨  Avoid complicated and time-consuming Data Plumbing operations
  • 26. Further developments 1.  Memory size limitation ¤  Add external in-memory tiers? 2.  Set-up overhead ¤  JDBC drivers, partitioning strategy and data frame from/to case class conversion (Spark 2 aims to solve this) 3.  Shared memory resources between Spark and Alluxio ¤  Set Alluxio as OFF_HEAP memory as well and divide memory in storage and cache 4.  In-Memory replication for read availability ¤  If an Alluxio node fails, data is lost due the absence of an underlying file system 5.  Would be nice if Alluxio could handle this and mount a relational table/view in the form of data files (csv, parquet…)
  • 27. Follow-up links ¨  Original article on DZone: ¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark- from-Hours-to-Seconds-With-Tachyon ¨  Professional Data Science Manifesto: ¤  datasciencemanifesto.org ¨  Vademecum of Practical Data Science: ¤  datasciencevademecum.wordpress.com ¨  Sparkz ¤  github.com/gm-spacagna/sparkz