Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Gianmario Spacagna
14th September, 2019 - Alluxio Meetup @ San Francisco, CA
Takeaways
¨  What a logical data warehouse is
¨  How to handle governance issues
¨  An Agile workflow made of iterative ex...
About me
¨  Engineering background in Distributed Systems
¤  (University of Cassino, Polytechnic of Turin, KTH of Stockhol...
Areas of interest
¨  Functional Programming, Scala and Apache
Spark
¨  Contributor of the
Professional Data Science Manife...
Data Science Agile cycle
Get
access to
data
Explore
TransformTrain
Evaluate
Analyze
results
Even dozens of
iterations per
...
Successful development
of new data products
requires proper
infrastructure and tools
Start by building a toy model with a small
snapshot of data that can fit in your laptop
memory and eventually ask your org...
¨  You can’t solve problems with data science if
data is not largely available
¨  Data processing should be fast and react...
Data Lake in a legacy enterprise
environment
Technical issues
¨  Engineering effort
¤  dedicated infrastructure team (expensive)
¨  Synchronization with new data from ...
Logical Data Warehouse
¨  View and access cleaned versions of data
¨  Always show latest version by default
¨  Apply trans...
What about governance issues?
¨  Large corporations can’t move data before an approved
governance plan
¨  Data can only be...
Long time and large investment for
setting up a new project
That’s not Agile!
Wait a moment, analysts don’t seem to
have this problem…
From disk to volatile memory
Distribute and make data temporary available in-
memory in an ad-hoc development cluster
¨  In-memory engine for distributed data processing
¨  JDBC drivers to connect to relational databases
¨  Structured data ...
In-memory workflow
Just Spark cache is not enough
¨  Data is dropped from memory
at each context restart due to
¤  Update dependency jar
(com...
Distribute and make data temporary persistently
available in-memory in the development cluster and
shared among multiple c...
¨  Formerly known as Tachyon
¨  In-memory distributed storage system
¨  Long-term caching of raw data and intermediate
res...
Alluxio as the Key Enabling Technology
1-tier configuration
¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk
¨  alluxio.worker.memory.size=24GB
¨  alluxio.worker.tieredsto...
Spark read/write APIs
¨  DataFrame
¤  dataframe.write.save(”alluxio://master_ip:port/mydata/
mydataframe.parquet")
¤  val ...
Making the impossible possible
¨  Agile workflow combining Spark, Scala, DataFrame,
JDBC, Parquet, Kryo and Alluxio to cre...
Further developments
1.  Memory size limitation
¤  Add external in-memory tiers?
2.  Set-up overhead
¤  JDBC drivers, part...
Follow-up links
¨  Original article on DZone:
¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark-
from-Hours...
Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016
Upcoming SlideShare
Loading in …5
×

Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

5,550 views

Published on

Presented by Gianmario Spacagna, Pirelli Tyre
Alluxio Meetup at Samsung
http://www.meetup.com/Alluxio/

Published in: Technology
  • Be the first to comment

Accelerating Machine Learning Pipelines with Alluxio at Alluxio Meetup 2016

  1. 1. Gianmario Spacagna 14th September, 2019 - Alluxio Meetup @ San Francisco, CA
  2. 2. Takeaways ¨  What a logical data warehouse is ¨  How to handle governance issues ¨  An Agile workflow made of iterative exploratory analysis and production-quality development ¨  A fully in-memory stack for fast computation on top of Spark and Alluxio ¨  How to successfully do data science if your data resides in a RDBMS and you don’t have a data lake
  3. 3. About me ¨  Engineering background in Distributed Systems ¤  (University of Cassino, Polytechnic of Turin, KTH of Stockholm) ¨  Data-relevant experience ¤  Predictive Marketing (AgilOne, StreamSend) ¤  Cyber Security (Cisco) ¤  Financial Services (Barclays) ¤  Automotive (Pirelli) ç
  4. 4. Areas of interest ¨  Functional Programming, Scala and Apache Spark ¨  Contributor of the Professional Data Science Manifesto ¨  Founder of Data Science Milan Meetup community (datasciencemilan.org) ¨  Co-authoring Python Deep Learning book, coming soon… Building production-ready and scalable machine learning systems (continue with list of principles...)
  5. 5. Data Science Agile cycle Get access to data Explore TransformTrain Evaluate Analyze results Even dozens of iterations per day!!!
  6. 6. Successful development of new data products requires proper infrastructure and tools
  7. 7. Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  8. 8. ¨  You can’t solve problems with data science if data is not largely available ¨  Data processing should be fast and reactive to allow quick iterations ¨  The core team cannot depend on IT folks Start by building a toy model with a small snapshot of data that can fit in your laptop memory and eventually ask your organization for cluster resources
  9. 9. Data Lake in a legacy enterprise environment
  10. 10. Technical issues ¨  Engineering effort ¤  dedicated infrastructure team (expensive) ¨  Synchronization with new data from source ¤  Report what portion of data has been exported and what not ¨  Consistency / Data Versioning / Duplication ¤  ETL logic and requirements change very often ¤  Memory is cheap but when you have hundreds of sparse copies of same data is confusing ¨  I/O cost ¤  Reading/writing is expensive for iterative and explorative jobs (machine learning)
  11. 11. Logical Data Warehouse ¨  View and access cleaned versions of data ¨  Always show latest version by default ¨  Apply transformations on-the-fly (discovery-oriented analytics) ¨  Abstract data representation from rigid structures of the DB’s persistence store ¨  Simply add new data sources using virtualization ¨  Flexible, fast time-to-market, lower costs
  12. 12. What about governance issues? ¨  Large corporations can’t move data before an approved governance plan ¨  Data can only be stored in a safe environment administered by only a few authorized people who don’t necessary understand data scientists needs ¨  Data leakage paranoia, cloud-phobia! ¨  As result, data cannot be easily/quickly pulled from the central data warehouse and stored into an external infrastructure
  13. 13. Long time and large investment for setting up a new project That’s not Agile!
  14. 14. Wait a moment, analysts don’t seem to have this problem…
  15. 15. From disk to volatile memory Distribute and make data temporary available in- memory in an ad-hoc development cluster
  16. 16. ¨  In-memory engine for distributed data processing ¨  JDBC drivers to connect to relational databases ¨  Structured data represented using DataFrame API ¨  Fully-functional data manipulation via RDD API ¨  Machine learning libraries (ML/MLllib) ¨  Interaction and visualization through Spark Notebook or Zeppelin
  17. 17. In-memory workflow
  18. 18. Just Spark cache is not enough ¨  Data is dropped from memory at each context restart due to ¤  Update dependency jar (common for mixed IDE development / notebook analysis) ¤  Re-submit the job execution ¤  Kerberos ticket expires L ¨  Fetching 600M rows can take ~ 1 hour in a 5 nodes cluster Dozens iterations per day => spending most of the time waiting for data to reload at each iteration!
  19. 19. Distribute and make data temporary persistently available in-memory in the development cluster and shared among multiple concurrent applications From volatile memory to persistent memory storage
  20. 20. ¨  Formerly known as Tachyon ¨  In-memory distributed storage system ¨  Long-term caching of raw data and intermediate results ¨  Spark can read/write in Alluxio seamlessly instead of using HDFS ¨  1-tier configuration safely leaves no traces to disk ¨  Data is loaded once and available for the whole development period to multiple applications
  21. 21. Alluxio as the Key Enabling Technology
  22. 22. 1-tier configuration ¨  ALLUXIO_RAM_FOLDER=/dev/shm/ramdisk ¨  alluxio.worker.memory.size=24GB ¨  alluxio.worker.tieredstore ¤  levels=1 ¤  level0.alias=MEM ¤  level0.dirs.path=${ALLUXIO_RAM_FOLDER} ¤  level0.dirs.quota=24G ¨  We leave empty the under FS configuration ¨  Deploy without mount (no root access required) ¤  ./bin/alluxio-start.sh all NoMount
  23. 23. Spark read/write APIs ¨  DataFrame ¤  dataframe.write.save(”alluxio://master_ip:port/mydata/ mydataframe.parquet") ¤  val dataframe: DataFrame = sqlContext.read.load(”alluxio:// master_ip:port/mydata/mydataframe.parquet") ¨  RDD ¤  rdd.saveAsObjectFile(”alluxio://master_ip:port/mydata/myrdd.object") ¤  val rdd: RDD[MyCaseClass] = sc.objectFile[MyCaseClass] (”alluxio:// master_ip:port/mydata/myrdd.object")
  24. 24. Making the impossible possible ¨  Agile workflow combining Spark, Scala, DataFrame, JDBC, Parquet, Kryo and Alluxio to create a scalable, in-memory, reactive stack to explore data directly from source and develop production-quality machine learning pipelines ¨  Data available since day 1 and at every iteration ¤  Alluxio decreased loading time from hours to seconds ¨  Avoid complicated and time-consuming Data Plumbing operations
  25. 25. Further developments 1.  Memory size limitation ¤  Add external in-memory tiers? 2.  Set-up overhead ¤  JDBC drivers, partitioning strategy and data frame from/to case class conversion (Spark 2 aims to solve this) 3.  Shared memory resources between Spark and Alluxio ¤  Set Alluxio as OFF_HEAP memory as well and divide memory in storage and cache 4.  In-Memory replication for read availability ¤  If an Alluxio node fails, data is lost due the absence of an underlying file system 5.  Would be nice if Alluxio could handle this and mount a relational table/view in the form of data files (csv, parquet…)
  26. 26. Follow-up links ¨  Original article on DZone: ¤  dzone.com/articles/Accelerate-In-Memory-Processing-with-Spark- from-Hours-to-Seconds-With-Tachyon ¨  Professional Data Science Manifesto: ¤  datasciencemanifesto.org ¨  Vademecum of Practical Data Science: ¤  datasciencevademecum.wordpress.com ¨  Sparkz ¤  github.com/gm-spacagna/sparkz

×