Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Weld Strata talk

1,994 views

Published on

Data analytics applications are often 10x off peak hardware performance since they combine multiple functions from different libraries and frameworks to build increasingly complex workflows. Even if each individual function is optimized in isolation, the cost of data movement across these functions can cause order of magnitude slowdowns. For example, even though the TensorFlow machine-learning library uses highly tuned linear algebra functions for each of its operators, workflows that combine these operators can be 16x slower than hand-tuned code. Similarly, workflows that perform relational processing in Spark SQL or pandas, numerical processing in NumPy, or a combination of these tasks spend most of their time in data movement across processing functions and could run between 2x and 30× faster if optimized end to end.

This talk offers an overview of Weld, an optimizing runtime for data-intensive applications that works across disjoint libraries and functions. Weld uses a common representation to capture the structure of diverse data-parallel workloads such as SQL, machine learning, and graph analytics and then optimizes across them using a cost-based optimizer that takes into account hardware characteristics. Weld can be integrated into a variety of widely used analytics frameworks, such as Spark SQL for relational processing, TensorFlow for machine learning, and Pandas and NumPy for general data science workloads. Integrating Weld with these frameworks requires no changes to user application code. Weld speeds up existing workloads in these frameworks by up to 16x and can also enable speed-ups of two orders of magnitude in applications that combine them.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Weld Strata talk

  1. 1. Weld: A Common Runtime for Data Analytics Shoumik Palkar, James Thomas, Deepak Narayanan, Anil Shanbhag*, Rahul Palamuttam, Holger Pirk*, Malte Schwarzkopf*, Saman Amarasinghe*, Sam Madden*, Matei Zaharia Stanford InfoLab, *MIT CSAIL
  2. 2. Motivation Modern data apps combine many disjoint processing libraries & functions » SQL, statistics, machine learning, … » E.g. PyData stack
  3. 3. Motivation Modern data apps combine many disjoint processing libraries & functions » SQL, statistics, machine learning, … » E.g. PyData stack + Great results leveraging work of 1000s of authors – No optimization across these functions
  4. 4. How Bad is This Problem? Growing gap between memory/processing makes traditional way of combining functions worse data = pandas.parse_csv(string) filtered = pandas.dropna(data) avg = numpy.mean(filtered) parse_csv dropna mean Up to 30x slowdowns in NumPy, Pandas, TensorFlow, etc. compared to optimized C implementation
  5. 5. Optimizing Across Frameworks machine learning SQL graph algorithms CPU GPU … Common Runtime …
  6. 6. Optimizing Across Frameworks machine learning SQL graph algorithms CPU GPU … … Weld IR Backends Runtime API Optimizer Weld runtime
  7. 7. Integrations Partial Prototypes of Pandas, NumPy, TensorFlow and Apache Spark
  8. 8. Existing Libraries + Weld = 30x TPC-H Logistic RegressionVector Sum 6.8x Faster 4.5x Faster 30x Faster 0 10 20 30 40 TPC-H Q1 TPC-H Q6 Runtime(sec) Spark SQL Weld 0 0.05 0.1 0.15 0.2 Vector sum Runtime(sec) NumPy NumExpr Weld 0.1 1 10 100 1000 1T 12T Runtime(sec,log10) TensorFlow Weld
  9. 9. Cross libraries + Weld = 290x 0.01 0.1 1 10 100 Runtime(sec,log10) Native Weld, no CLO Weld, CLO Weld, 12 core Pandas + NumPy 290x 31x 0.0 0.5 1.0 1.5 2.0 Runtime(sec) Scala UDF Weld Spark SQL UDF 14x CLO = with cross library optimization
  10. 10. Integration Effort Small up front cost to enable Weld integration » 500 LoC Easy to port over each operator » 30 LoC each Incrementally Deployable » Weld-enabled ops work with native ops
  11. 11. Implementation Around 12K lines of Rust » Fast and safe! Perfect for embedding Parallel CPU backend using LLVM GPU support coming soon
  12. 12. Weld is Open Source! weld.stanford.edu
  13. 13. Rest of This Talk Runtime API: Enabling cross-library optimization Weld IR: Enabling speedups and automatic parallelization Grizzly: Using Weld to build a Faster Pandas
  14. 14. Runtime API Uses lazy evaluation to collect work across libraries data = lib1.f1() lib2.map(data, item => lib3.f2(item) ) User Application Weld Runtime Combined IR program Optimized machine code 1101110 0111010 1101111 IR fragments for each function Runtime API f1 map f2 Data in application
  15. 15. Weld IR Designed to meet three goals: 1. Generality: support diverse workloads and nested calls 2. Ability to express optimizations: e.g., loop fusion, vectorization, loop tiling 3. Explicit parallelism and targeting parallel hardware
  16. 16. Weld IR: Internals Small IR with only two main constructs. Parallel loops: iterate over a dataset Builders: declarative objects for producing results » E.g. append items to a list, compute a sum » Can be implemented differently on different hardware Captures relational algebra, functional APIs like Spark, linear algebra, and composition thereof
  17. 17. Examples Implement functional operators using builders def map(data, f): builder = new vecbuilder[int] for x in data: merge(builder, f(x)) result(builder) def reduce(data, zero, func): builder = new merger[zero, func] for x in data: merge(builder, x) result(builder)
  18. 18. Example Optimization: Fusion squares = map(data, x => x * x) sum = reduce(data, 0, +) bld1 = new vecbuilder[int] bld2 = new merger[0, +] for x in data: merge(bld1, x * x) merge(bld2, x) Loops can be merged into one pass over data
  19. 19. “Typically lumber along at 1.2-1.8 MPH”“Active, agile, hard to out-run, top speed of 35 MPH”
  20. 20. Grizzly A subset of Pandas integrated with Weld » Operators ported over including unique, filter, mask » Easy to port more! Same API as native Pandas so zero changes to applications Transparent single-core and multi-core speedups
  21. 21. Grizzly in action import pandas as pd # Read dataframe from file requests = pd.read_csv(‘filename.csv’) # Fix requests with extra digits requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5) # Fix requests with 00000 zipcodes zero_zips = requests['Incident Zip'] == '00000’ requests['Incident Zip'][zero_zips] = np.nan # Display unique incident zips print requests['Incident Zip'].unique() Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
  22. 22. Grizzly in action import pandas as pd, grizzly as gr # Read dataframe from file raw_requests = pd.read_csv(‘filename.txt’) requests = gr.DataFrame(raw_requests) # Fix requests with extra digits requests['Incident Zip'] = requests['Incident Zip'].str.slice(0, 5) # Fix requests with 00000 zipcodes zero_zips = requests['Incident Zip'] == '00000’ requests['Incident Zip'][zero_zips] = np.nan # Display unique incident zips print requests['Incident Zip'].unique().evaluate() Adapted from http://pandas.pydata.org/pandas-docs/stable/tutorials.html (chapter 7)
  23. 23. Demo
  24. 24. Conclusion Changing the interface between libraries can speed up data analytics applications by 10-100x on modern hardware Try out Weld for yourself: weld.stanford.edu

×