Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AIDC NY: BODO AI Presentation - 09.19.2019

148 views

Published on

Presentation provided by one of Intel's AI Builder partners.

Published in: Software
  • Be the first to comment

  • Be the first to like this

AIDC NY: BODO AI Presentation - 09.19.2019

  1. 1. Bodo Inc. Confidential – Do not copy Simplify Data Science at Scale Automatically Fastest Compute Cluster Engine for Data Science www.bodo-inc.com
  2. 2. Bodo Inc. Confidential – Do not copy Problem: Advanced Analytics is Complex and Slow Segregated Environments/Code Rewrites Complexity & Redundancies Performance & Scale Quality of InsightsUnproductive Data Science 60% “fail beyond Pilot” 78% “struggle o analyze, high volume of data” 10% “of data used” 64% “time Spent on Data Prep” GARTNER report 2017, Tech Republic 2018 survey
  3. 3. Bodo Inc. Confidential – Do not copy Segregated Development & Production Environments Development Production Small Sample Data Rewrite The Code for Performance, Scalability & Reliability Test & Validate Against The Original Code Different Sizes & Sources of Data Used Cumbersome Processes, Different Expertise, Costly Local Files Custom (e.g. Typical Enterprise Environment)
  4. 4. Bodo Inc. Confidential – Do not copy Unified Development & Production Environments A single code. A single data lake. Scaling Python AI to HPC Data Lake/Warehouse Middleware Engine & Connectors Language Instant Deployment to Production Automatic Scalability Performance & Power Optimization Built-in High-Performance Connectors A Single & Unified Code Repository with Version and Access control (e.g. GitHub) A Single Data Lake for All with Access Control & Security Data Frameworks Data Scientists DevOps/IT Engineers Other Teams
  5. 5. Bodo Inc. Confidential – Do not copy Bodo Core Engine Automatic Parallelization Analytics Engine Integrated HPC Architecture Data structures/APIs for manipulating numerical data, tables, and time series Proprietary Open Source Translates a basic subset Python into fast machine code at run time Core Engine We take advantage of various HPC technologies to get efficient execution on hardware resources Auto Deployment @bodo.jit def get_stats(): … df[‘latency'].sum() df[‘latency'].mean() … vucomisd %xmm0, %xmm0 setnp %dl jp .LBB0_11 vaddsd %xmm0, %xmm2, %xmm2 .LBB0_11: vaddsd %xmm0, %xmm3, %xmm1 vcmpunordsd %xmm0, %xmm0, %xmm0 vblendvpd %xmm0, %xmm3, %xmm1 Optimized binaries Open Source Initially based on Intel* HPAT research Intel* MPI Intel* TBB
  6. 6. Bodo Inc. Confidential – Do not copy @bodo.jit def example(): table = pd.read_parquet('data.parquet') data = table[table[‘A’].str.contains(‘ABC*', regex=True)] stats = data[‘B'].describe() print(stats) Mean, std, min, max, 25/50/75% quantiles, count *100M samples, 2SU Intel(R) Xeon(R) Platinum 8180 nodes Example Program Power Optimized >10X better power HPC Scaling 115X speed up on 4 node Simple No change to the code More capability E.g. Quantile calculation
  7. 7. Bodo Inc. Confidential – Do not copy • Problem: • Exchange order books need to be monitored in real time to ensure market safety and fairness. • Existing infrastructure is complex and not fast enough. • Requirements: high performance and efficiency, portable to enable multi-cloud strategy, simple to manage infrastructure Real time analysis of “order-books” in real time for surveillance and fraud detection Challenge • 35X speed up on 56 cores over Python (similar speed up vs. Spark) • Portable across existing virtualized Hadoop cluster and all cloud platforms Bodo Solution Value Proposition: high performance, portable and simple infrastructure 64.9 26.6 (2.5x) 1.84 (35x) 0 20 40 60 80 Python (1 core) Bodo (1 core) Bodo (56 cores) Execution time (s) Transaction Analytics Performance “helps us use true representation of data instead of statistical representation” Very Large Exchange
  8. 8. Bodo Inc. Confidential – Do not copy • Problem: • Complex risk model in Python developed in short cycles, • Code can only run on just 1% of data in real time • Rewrite by IT is not practical (time, maintenance, etc.) • Requirements: minimal code changes, integrate into existing framework, simple infrastructure • Trials of Spark, Dask and SAS unsuccessful 28.6 14.44 7.51 4.01 2.28 1.49 1.22 1 10 100 1 2 4 8 16 32 56 Execution time (s) Number of Cores Risk Model Evaluation Bodo Python 54.58 Real time risk analysis of credit card portfolios “We want to use this technology for many of our applications” Challenge • Integrated with minimal code changes • 45X improvement in execution time over Python • Unified development and deployment environments Bodo Solution Value Propositions: 1) Unify environments, 2) Capturing full value of dataset Very Large Bank
  9. 9. Bodo Inc. Confidential – Do not copy Optimized for Intel* Xeon* Scalable Processors: • Code optimizations, e.g. memory access improvements • Automatic vectorization (e.g. Intel* AVX-512) • Multi-core efficient scaling • Multi-node efficient scaling (using Intel* MPI) • Integration with Intel libraries (e.g. Daal4py) Scalability & Efficiency suited for Intel’s networking and storage products • Intel* Omni-Path Architecture, • Intel* Optane Bodo on Intel Platform
  10. 10. Bodo Inc. Confidential – Do not copy • Intraday Mean Reversion back-testing on historical market data • Timeseries operations: rolling window, shift • Efficient scaling across large market data sets Demo 1: Trading Strategy Back-testing
  11. 11. Bodo Inc. Confidential – Do not copy 11 Trading Strategy Back-testing Example @bodo.jit(locals={'s_open’: bodo.float64[:], …}) def intraday_mean_revert(): f = h5py.File("stock_data.hdf5", "r"); … for i in prange(nsyms): symbol = sym_list[i] s_open = f[symbol+'/Open'][:]; … df = pd.DataFrame({'Open': s_open, …}) df['Stdev'] = df['Close'].rolling(window=90).std() df['Moving Average'] = df['Close'].rolling(window=20).mean() df['Criteria1'] = (df['Open'] - df['Low'].shift(1)) < -df['Stdev'] df['Criteria2'] = df['Open'] > df['Moving Average'] df['BUY'] = df['Criteria1'] & df['Criteria2'] df['Pct Change'] = (df['Close'] - df['Open']) / df['Open'] df['Rets'] = df['Pct Change'][df['BUY'] == True] n_days = len(df['Rets']) res = np.zeros(max_num_days) if n_days: res[-n_days:] = df['Rets'].fillna(0) all_res += res Loop level parallelism over stock symbols Extra type annotation for dynamic I/O Time series operators (rolling window, shift) Automatic distributed communication
  12. 12. Bodo Inc. Confidential – Do not copy • TPCx-BB: standard data-centric benchmarks for ML • Retail analytics dataset: customer info, store sale transactions, etc. • Q26: cluster customers into groups based on their in purchasing histories • ETE workload: data loading, preprocessing, feature engineering, ML algorithm • Representative of complex data pipelines in many domains Demo 2: TCPx-BB Q26
  13. 13. Bodo Inc. Confidential – Do not copy 13 TCPx-BB Q26 Example @bodo.jit def q26(ss_file, i_file, category, item_count): ss_dtype = {'ss_item_sk': np.int64, 'ss_customer_sk': np.int64} store_sales = pd.read_csv(ss_file, sep='|', usecols=[2, 3], names=ss_dtype.keys(), dtype=ss_dtype) … item2 = item[item['i_category']==category] sale_items = pd.merge( store_sales, item2, left_on='ss_item_sk', right_on='i_item_sk’) count1 = sale_items.groupby('ss_customer_sk')['ss_item_sk'].count() gp1 = sale_items.groupby('ss_customer_sk')['i_class_id’] def id1(x): return (x==1).sum() … def id15(x): return (x==15).sum() customer_i_class = gp1.agg((id1, …, id15)) customer_i_class['ss_item_count'] = count1 customer_i_class = customer_i_class[ customer_i_class.ss_item_count > item_count] data = customer_i_class.values.astype(np.float64) Flexible I/O Large-scale Join Groupby with UDF
  14. 14. Bodo Inc. Confidential – Do not copy PRODUCTIVE DATA SCIENCE Focus on insights vs data prep Access to full dataset ACCURATE INSIGHTS More iteration Deeper analytics COST EFFECTIVE Reduce Infrastructure complexity Unified Environments Value Proposition MULTI CLOUD STRATEGY SUPPORT On-premise, Public Cloud, Edge Workloads on any cloud combination EFFICIENT COMPUTE CLUSTER Leverage existing infrastructure Maximize distributed architecture HETROGENOUS INFRASTRUCTURE Best application runs on the best cluster Edge performance on Edge devices
  15. 15. Bodo Inc. Confidential – Do not copy Backup
  16. 16. Bodo Inc. Confidential – Do not copy Python code Spark API code Spark Runtime Python code Cluster/cloud Parallel binary (MPI) Cluster/cloud Rewrite Compile Driver Executor 1 … … Rank 0 …Rank 1 Rank N-1 Driver Executor N-1Executor 0 Waves of tiny tasks Few and efficiently running processes Totoni et al. “A Case Against Tiny Tasks in Iterative Analytics”, HotOS’17 Workflow Comparisons Bodo’s parallel architecture achieves HPC efficiency
  17. 17. Bodo Inc. Confidential – Do not copy 17 Bodo is 20x-256x faster than Spark Cori at NERSC/LBL 64 nodes (2048 cores) Amazon AWS 4 nodes c4.8xlarge (144 vCPUs) Bodo is 370x-2000x faster than Spark Bodo achieves high productivity and HPC performance Bodo Julia used, Python is similar Totoni et al. “HPAT: High Performance Analytics with Scripting Ease-of-Use”, ICS’17 Performance Comparisons 46.2 102 64.1 547 0.08 1.47 1.09 0.83 0.18 5.08 1.81 2.91 0.01 0.1 1 10 100 1000 Kernel Density Linear Regression Logistic Regression K-Means Execution time (s) Spark MPI/C++ Bodo 61 767 830 351 0.013 1 0.5 0.24 0.03 2.06 0.98 0.95 0.01 0.1 1 10 100 1000 Kernel Density Linear Regression Logistic Regression K-Means Execution time (s) Spark MPI/C++ Bodo
  18. 18. Bodo Inc. Confidential – Do not copy Products Plug-ins Cloud Platform Bodo-Core Streamline & Unify Data Science Bodo-Cloud Fastest & the Most Efficient, Compute Cluster for Data Science Productivity Toolkit Full ETL/Analytics Data Ecosystem Integration Workload Management Enterprise Security Interactive NotebooksIntegration & ConnectorsImage Processing Visualization Stream Processing
  19. 19. Bodo Inc. Confidential – Do not copy PRODUCTIVE DATA SCIENCE Focus on insights vs data prep Access to full dataset ACCURATE INSIGHTS More iteration Deeper analytics COST EFFECTIVE Reduce Infrastructure complexity Unified Environments Value Proposition MULTI CLOUD STRATEGY SUPPORT On-premise, Public Cloud, Edge Workloads on any cloud combination EFFICIENT COMPUTE CLUSTER Leverage existing infrastructure Maximize distributed architecture HETROGENOUS INFRASTRUCTURE Best application runs on the best cluster Edge performance on Edge devices
  20. 20. Bodo Inc. Confidential – Do not copy Technology Differentiators ANALYTICS ENGINE Innovative compiler approach to scalable analytics Automatic parallelization technology Optimization compiler pipeline for analytics PARALELL ARCHITECTURE Integrated HPC technology (MPI) Ability to integrate with many optimized libraries Native parallelism support for all analytics patterns NEW CAPABILITIES Handling user defined functions natively Rolling window computation (IOT apps.) Statistical functions w/ complex parallelism
  21. 21. Bodo Inc. Confidential – Do not copy • Problem: • A complex AI transaction reconciliation application is developed and maintained by domain expert in Python. • It needs to run in real time on client bank’s servers as part of a software package without major modification. • Data too large and app runs very slow in user environment • Requirements: Minimal code changes, easy to deploy Embedded AI in transaction reconciliation application Challenge • Scaled existing code and met deployment requirements • Deploy and run seamlessly in user environment • 360X string processing performance improvement Bodo Solution Value Proposition: Data-centric business application with AI deployed “at the edge” 97.5 13.13 (7.4x) 0.63 (155x) 0.27 (361x) 0.1 1 10 100 Python (1 core) Bodo (1 core) Bodo (1 node, 36 cores) Bodo (4 nodes, 144 cores) Execution time (s) String processing UDF Performance VP of Technology: “This is amazing… I’m speechless… It’s much simpler than Spark also” Very Large Financial ISV
  22. 22. Bodo Inc. Confidential – Do not copy • Problem: • AI application monitors data streams on an edge device for network security. • The Python code developed by experts needs to run in real time on 30k streams, but cannot be rewritten due to maintenance requirements in segregated environments. • Requirement: Changing code/data format not viable due to segregated teams Real time Network security at the edge using AI Challenge • Demonstrated viability of meeting the full requirements • Required no code re-write, 4X speed up at the edge • Customer delivered their network security appliance to their end user without compromising performance or cost. Bodo Solution Value Proposition: 1) Real time Analytics at the edge 2) Avoiding code rewrite Largest networking equipment maker
  23. 23. Bodo Inc. Confidential – Do not copy • Problem: • R&D team runs batch Python code in overnight which slows development • Real time execution is critical for both development by R&D and IT teams. • Pre-processing packet date/time fields for ML is bottleneck • Requirement: Easy to program for research team, simple infrastructure to manage • Alternatives considered: Spark, OmniSci Analyze network logs & predict load using AI Challenge • We make overnight processing real time! • Accelerated with minimal code change, easy to install • Up to 14000X improvement Bodo Solution Value Proposition: Simplicity and flexibility in programming & scaling to real time. “I can’s believe the speed; its over 1000X improvement” 640.1 6.44 (99x) 0.121 (5300x) 0.045 (14200x) 0.01 0.1 1 10 100 1000 Python Bodo (1 core) Bodo (1 node, 56 cores) Bodo (4 nodes, 224 cores) Execution time (s) Data preprocessing performance Large US Based Telco-Carrier
  24. 24. Bodo Inc. Confidential – Do not copy • Cloud-based BI software product extracts insights and “explains” the data to business user using AI. • Problem: complex Python backend code is not fast enough for real time interactive usage. Rewriting the code is not practical. • Requirements: Run real time, integrate seamlessly, low cloud cost for many simultaneous users • No solution has met the scaling objectives Automatic BI insights using real-time AI Challenge • Accelerated a function with a few lines of code change • 5X acceleration of the entire product. Reduced execution time for many simultaneous users from 162sec to 35sec • Design win at Oracle, work underway to further optimize Bodo Solution Value Proposition: Scaling complex Python analytics code to real time Large Database Company and CSP
  25. 25. Bodo Inc. Confidential – Do not copy Intel Labs 25 40.6 43.5 5.6 0.88 7.34 111.8 0.23 2.04 1.6 0.1 1 10 100 1000 Filter Join Aggregate Execution time (s) Python (pandas) Spark SQL HPAT 4 dual-socket Intel ® Xeon® E5-2699 v3 (144 total cores) 20,000x 2.01 13.4 252.5 1.4 73.8 74.8 239 248 325 0.18 0.016 0.016 0.01 0.1 1 10 100 1000 cumsum SMA WMA Execution time (s) Python (pandas) Spark SQL HiFrames Bodo data primitives 3.6x to 70x faster than Spark SQL Bodo analytics primitives 1300x to 20000x faster than Spark SQL Totoni et al. “HiFrames: High Performance Data Frames in a Scripting Language”, arXiv 2017 Dataframe Primitives
  26. 26. Bodo Inc. Confidential – Do not copy 26 Bodo Limitation: type stability • Input code to Bodo should be statically compilable (type stable) • Dynamically typed code examples (rare in analytics): Untypable variable: Unresolvable function: Nonstatic dataframe schema: if flag1: a = 2 else: a = np.ones(n) if isinstance(a, np.ndarray): doWork(a) if flag2: f = np.zeros else: f = np.ones b = f(m) if flag2: df = pd.DataFrame({‘A’: [1,2,3]}) else: df = pd.DataFrame({‘A’: ['a', ‘b'. ‘c']}) b = f(m)
  27. 27. Bodo Inc. Confidential – Do not copy import pandas as pd import bodo @bodo.jit def read_pq(): df = pd.read_parquet('cycling_dataset.pq') ... return result Block1 Block2 Block3 Block4 Parallel File-Read Sequential/monolithic file Currently supports CSV, Parquet and HDF5 Block-parallel read parallelizes following operations 27
  28. 28. Bodo Inc. Confidential – Do not copy import pandas as pd import bodo @bodo.jit def read_pq(): df = pd.read_parquet('cycling_dataset.pq') df = df[df.power!=0] df['hr'] = df['hr'] * 2 … Block1 Block2 Block3 Block4 Data Parallel Operations Sequential/monolithic file Data parallel operations (like filters, operations on individual rows) require no communication. Block1' Block2' Block3' Block4' Block1'' Block2'' Block3'' Block4'' 28
  29. 29. Bodo Inc. Confidential – Do not copy @bodo.jit def read_pq(): df = pd.read_parquet('cycling_dataset.pq') result = df.hr.mean() ... Block1 Block2 Block3 Block4 Parallel Reduction Sequential/monolithic file Reductions (like mean, avg etc) are transformed to efficient MPI code as known from HPC. Results from reductions get replicated on all processes 29
  30. 30. Bodo Inc. Confidential – Do not copy @bodo.jit def read_pq(): df = pd.read_parquet('cycling_dataset.pq') ... grp = df.groupby('hour') mean = grp['power'].mean() ... Block1 Block2 Block3 Block4 Parallel Groupby + Aggregation Sequential/monolithic file Potentially more complex communication than simple reductions. Result will be block distributed (potentially with variable block sizes) 30
  31. 31. Bodo Inc. Confidential – Do not copy 31 Machine Learning with daal4py import daal4py as d4p import pandas as pd # get inertia for various numbers of clusters @bodo.jit def find_clusters(): X = pd.read_parquet(…).values distorsions = [] for k in range(2, 20): kmi = d4p.kmeans_init(k) icenters = kmi.compute(X).centroids result = d4p.kmeans(k, 300).compute(X, icenters) distorsions.append(result.goalFunction[0][0]) return distorsions Block1 Block2 Block3 Block4 Sequential/monolithic file
  32. 32. Bodo Inc. Confidential – Do not copy 32 Software Architecture ParallelAccelerator Bodo Numba LLVM Distributed-memory parallelism, Data I/O, Data frames Loop parallelism (Numpy, explicit), shared-memory Compile sequential Python/Numpy Binary code generation Now in Numba MPI parallel runtime
  33. 33. Bodo Inc. Confidential – Do not copy Intel Labs 33 Python function Efficient MPI binary Bytecode to IR Dataframe transform Series transform ParallelAccelerator passes Distributed analysis/transform bytecode Numba IR Numba IR with dataframe nodes Numba IR with series converted Numba IR with array ops optimized Numba IR with MPI calls Compilation Pipeline
  34. 34. Bodo Inc. Confidential – Do not copy Intel Labs Key idea: transform dataframes to underlying arrays • Simple case: dataframe operation becomes series/array operation • Harder cases: dataframe operation becomes an IR node • Requirement: schema of dataframes is known statically Dataframe Compilation df2 = df1.head() df2_A = df1_A.head() df2_B = df1_B.head() … df3 = pd.merge(df1, df2, …) Join(keys, l_in_columns, …)
  35. 35. Bodo Inc. Confidential – Do not copy Intel Labs 35 1 Anderson et al. “Parallelizing Julia with a Non-invasive DSL”, ECOOP’17 D = A * B + C parfor i=1:n t[i]=A[i]*B[i] parfor i=1:n D[i]=t[i]+C[i] parfor i=1:n D[i]=A[i]*B[i]+C[i] Recognize parallelism Fuse loops * + A B C = D https://github.com/IntelLabs/parallelaccelerator.jl https://github.com/numba/numba Data Parallelism Extraction
  36. 36. Bodo Inc. Confidential – Do not copy Intel Labs Exploit domain properties: • High-level Pandas/Numpy operations are implicitly parallel • Map/reduce + relational parallel patterns • One-dimensional block distribution of data and compute • “Big” collections are distributed, “small” collections are replicated Data flow compiler algorithm: • Transfer functions for operations • Fixed-point iteration converges to optimal solution Automatic Parallelization Approach Totoni et al. “HPAT: High Performance Analytics with Scripting Ease-of-Use”, ICS’17
  37. 37. Bodo Inc. Confidential – Do not copy Intel Labs 37 samples labels P0 w a0 b0 c0 y0 y1 y2 y3 y4 y5 w0 w1 w2 y0 y1 y2 y3 y4 y5 w0 w1 w2 w0 w1 w2 w0 w1 w2 P1 P2 1D block distribution 1D block distribution replicated a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 a0 b0 c0 a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 Automatic Distribution
  38. 38. Bodo Inc. Confidential – Do not copy Intel Labs 38 Automatic Distribution A b a0 b0 c0 b0 b1 b2 1D block distribution replicated a1 b1 c1 a2 b2 c2 a3 b3 c3 a4 b4 c4 a5 b5 c5 w = np.dot(A, b) w0 w1 w2 replicated w
  39. 39. Bodo Inc. Confidential – Do not copy Data Flow Framework • Distribution meet-semilattice • Transfer function for each node type • Based on high-level semantics • Overall program transfer function: • Solve using fixed-point iteration • Converges with monotone transfer functions LADa ®: ),(),( papa DDFDD = meet operator ˄ 1D_B 2D_BC REP LPDp ®: },,{ REP2D_BC1D_BL = 1D_B2D_BCREP ££ 1D_BREP =T^= ,
  40. 40. Bodo Inc. Confidential – Do not copy 40 Fundamental Problem: “Tiny Tasks1” • Break data-parallel cluster compute jobs into 100ms tasks • Advantages of tiny tasks: • Fault tolerance, load balancing, job scheduling, … • Disadvantages of tiny tasks: • Scheduling, serialization, communication overheads2 • We argue tiny tasks are not necessary3 • Fault tolerance, ... can be achieved differently 1Ousterhout et al. “The Case for Tiny Tasks in Compute Clusters”, HotOS’13 2McSherry et al. “Scalability! but at what COST?”, HotOS’15 3Totoni et al. “A Case Against Tiny Tasks in Iterative Analytics”, HotOS’17
  41. 41. Bodo Inc. Confidential – Do not copy 41 “Tiny Tasks” for Fault Tolerance • Tiny tasks enable frequent checkpoints in Spark • But optimal checkpoint period is long in ML • 1000s of seconds • Young-Daly formula: • C is checkpointing time, µ is MTBF 𝑃 = 2𝐶𝜇
  42. 42. Bodo Inc. Confidential – Do not copy Intel Labs 42 Communication primitives • Shuffle using tiny tasks (e.g. Spark): • Scheduler launches map tasks that write intermediate shuffle files • Scheduler launches reduce tasks that read the files • Lots of overheads • All-to-all collective (e.g. MPI_Alltoall): • Textbook algorithm: run n steps, at step i, p sends to p+i and receives from p-i • Many optimized algorithms based on network topology and message sizes • No centralized scheduler, no task launch, no files
  43. 43. Bodo Inc. Confidential – Do not copy Intel Labs 43 Map/reduce Limitation for Analytics • Some analytics operations don’t fit Spark map/reduce pattern • Moving averages require near neighbor exchange • Cumulative sum requires prefix scan (MPI_Scan):
  44. 44. Bodo Inc. Confidential – Do not copy 44 Time Series Analytics • Time series data naturally produced from many sources (video, IoT, finance, …) • Key underlying problem: handling parallel algorithms with fine-grained communication • Bodo maps high-level semantics to MPI asynchronous primitives • Example: ‘window’ functions Comm across data partitions df.rolling('5min’, on='time')['pid'].apply( lambda a: pd.Series(a).nunique())
  45. 45. Bodo Inc. Confidential – Do not copy 45 Time Series Analytics • Merging time series: e.g. latest value of IMU sensor for front camera images • Basic parallel algorithm for ‘asof’ join: • Broadcast the left key boundaries of all processors (Allgather) • Find interval overlap of local data with all other processors • Shuffle data (Alltoall) • Better algorithms possible Time

×