Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing


Published on

eBay has been using enterprise ADBMS for over a decade, and our team is working on batch workload migration from ADBMS to Spark in 2018. There has been so many experiences and lessons we got during the whole migration journey (85% auto + 15% manual migration) - during which we exposed many unexpected issues and gaps between ADBMS and Spark SQL, we made a lot of decisions to fulfill the gaps in practice and contributed many fixes in Spark core in order to unblock ourselves. It will be a really interesting and should be helpful sharing for many folks especially data/software engineers to plan and execute their migration work. And during this session we will share many very specific issues each individually we encountered and how we resolve & work-around with team in real migration processes.

Published in: Data & Analytics
  • Be the first to comment

Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing

  1. 1. WIFI SSID:SparkAISummit | Password: UnifiedAnalytics
  2. 2. Keith Sun Data Engineer, Data Service & Solution (eBay) Deep Dive of ADBMS Migration to Apache Spark—Use Cases Sharing #UnifiedAnalytics #SparkAISummit
  3. 3. About US • DSS(Data Services & Solutions) team in eBay. • Focus on Big data development ,optimization, modeling & services on ADBMS, Spark/Hive, Hadoop platforms. • Now , more time on the migration from ADBMS to Spark.
  4. 4. Talks from our team • Experience Of Optimizing Spark SQL When Migrating from MPP Database, Yucai Yu & Yuming Wang, Spark Summit 2018, London. • Analytical DBMS to Apache Spark Auto Migration Framework, Edward Zhang, Spark Summit 2018, London. 4
  5. 5. Agenda 5 Background Use Cases and Best Practices Auto Migration Deep Dive
  7. 7. Spark as DW Processing Engine 7 Integrated Data Layer ZETA ODS Layer Metadata Knowledge Graph RT Data Service Batch Service Metadata Service DS (Data Science) DW (Data Warehouse) DI (Data Infrastructure) Compute/Storage Model
  8. 8. Spark Cluster Environment 8 1900 Nodes 460TB Memory Spark 2.1.0/2.3.1 Hadoop 2.7.1 Hive 1.2.1
  9. 9. Agenda 9 Background Use Cases and Best Practices Auto Migration Deep Dive
  10. 10. Migration Steps Overview 10 Table Schema Translation SQL Conversion Historical Data Copy SQL run on Yarn cluster Post Data Quality Check Logging and Error Parsing
  11. 11. Table Schema Translation 11 Single Partitioned Table Is Not Enough Column Name Is Case Sensitive Column Type Mapping Tips
  12. 12. Single Partitioned Table Is Not Enough Ø “Cannot overwrite a path that is also being read from.” regardless of different partitions. See SPARK-18107. Instead , create 2 tables : TableX & TableX_Merge. 12
  13. 13. Table DDL Sample 13 CREATE TABLE Table_X_Merge( … dt string ) USING parquet OPTIONS ( path 'hdfs://hercules/table_x/snapshot/' ) PARTITIONED BY (dt) CREATE TABLE Table_X ( ….. ) USING parquet OPTIONS ( path 'hdfs://hercules/table_x/snapshot/ dt=20190311’ ) ---point latest partition
  14. 14. Column Name Is Case Sensitive Ø Lowercase the column name. For Hive/Spark Parquet file interoperation , otherwise you may see “NULL” fields, wrong result or errors . (SPARK-25132) 14
  15. 15. Spark 2.1.0 throw error : 15 “Caused by: java.lang.IllegalArgumentException: Column [id] was not found in schema!”
  16. 16. Spark 2.3.1 returns wrong result silently. 16
  17. 17. Column Type Mapping Tips Ø Decimal typed integer map to Integer For Parquet filter push down to accelerate file scan. 17
  18. 18. Sample For Parquet filter push down to accelerate file scan.(SPARK-24549 ) 18
  19. 19. Query Improvements – Predicate Pushdown [SPARK-25419] Improvement parquet predicate pushdown • [SPARK-23727] Support Date type • [SPARK-24549] Support Decimal type • [SPARK-24718] Support Timestamp type • [SPARK-24706] Support Byte type and Short type • [SPARK-24638] Support StringStartsWith predicate • [SPARK-17091] Support IN predicate
  20. 20. SQL Conversion 20 Update & Delete Conversion Insert Conversion Number Expression String Expression Recursive Query Conversion
  21. 21. SQL Conversion- Update/Delete Spark-SQL does not support update/delete yet. Transform the update/delete to insert or insert overwrite. 21
  22. 22. ADBMS Use case 22 update tgt from database.tableX tgt, database.Delta ods set AUCT_END_DT = ods.AUCT_END_DT where tgt.LSTG_ID = ods.LSTG_ID insert into database.tableX( LSTG_ID,AUCT_END_DT) select LSTG_ID ,AUCT_END_DT from database.Delta ods left outer join database.tableX tgt on tgt.LSTG_ID = ods.LSTG_ID where tgt.LSTG_ID is null; Yesterda y Full Data Delta
  23. 23. Spark-SQL sample 23 insert overwrite table TableX_merge partition(dt='20190312') select coalesce(tgt.LSTG_ID,ods.LSTG_ID) as LSTG_ID ,IF(ods.LSTG_ID is not null, ods.AUCT_END_DT,tgt.AUCT_END_DT) as AUCT_END_DT from TableX as tgt full outer join Delta ods on tgt.LSTG_ID = ods.LSTG_ID ; alter table TableX set location ‘xxxx/dt=20190312’;
  24. 24. SQL Conversion- Insert Ø ADBMS will implicitly dedupe data when insert into SET table(the default case for new tables). Then, for such case, a “group by” or “distinct” is necessary. 24
  25. 25. ADBMS Use case (TableY is defined a SET table ) insert into TableY( LSTG_ID,AUCT_END_DT) select LSTG_ID ,AUCT_END_DT from ods_tableY tgt 25
  26. 26. Spark-SQL sample 26 insert overwrite table TableY_merge partition(dt='20190312') select distinct * from ( select LSTG_ID, AUCT_END_DT FROM TableY tgt UNION ALL select LSTG_ID, AUCT_END_DT FROM ODS_TableY) tmp;
  27. 27. SQL Conversion – Number Expression Ø Rounding behavior ADBMS round with “HALF_EVEN” rule by default, but Spark-SQL use “HAFL_UP”. 27
  28. 28. ADBMS Sample select cast(2.5 as decimal(4,0)) as result; 2. select cast(3.5 as decimal(4,0)) as result; 4. 28
  29. 29. Spark-SQL Result spark-sql> select cast(2.5 as decimal(4,0)) 3 spark-sql> select bround(2.5,0) as col1; 2 29
  30. 30. SQL Conversion – Number Expression Ø Number division result ADBMS return ab Integer for Integer division , while Spark always return double . Explicitly cast division result to integer in Spark SQL. 30
  31. 31. Number division sample ADBMS: select 3/4 as col1; 0 31 Spark-SQL: spark-sql> select 3/4 0.75 spark-sql> select cast(3/4 as int); 0
  32. 32. SQL Conversion- String Expression Ø Case sensitivity in comparison/group by ADBMS is case insensitive in comparison, while it is case sensitive in Spark-SQL. Apply lower/upper function to string columns before comparison/group by 32
  33. 33. ADBMS Use case 33 tableA tableB col1 col2 col1 col2 abc 100 Abc 100 Select a.col1, b.col2 from tableA a inner join tableB b on a.col1 = b.col1 a.col1 b.col1 abc Abc
  34. 34. Spark-SQL Sample Select a.col1, b.col2 from tableA a inner join tableB b on a.col1 = b.col1 No result. Select a.col1, b.col2 from tableA a inner join tableB b on lower(a.col1) = lower(b.col1) 34 a.col1 b.col1 abc Abc
  35. 35. SQL Conversion- String Expression Ø Auto trim trailing spaces in ADBMS A trim() function has to be applied to columns of “CHAR” type in Spark-SQL. 35
  36. 36. ADBMS Use Case 36 tableA tableB col1 col2 col1 col2 Abc 100 Abc 100 Select a.col1, b.col2 from tableA a inner join tableB b on a.col1 = b.col1 a.col1 b.col1 abc Abc Abc Abc
  37. 37. Spark-SQL Sample Select a.col1, b.col2 from tableA a inner join tableB b on a.col1 = b.col1 No result. Select a.col1, b.col2 from tableA a inner join tableB b on trim(a.col1) = b.col1 37 a.col1 b.col1 abc AbcAbc
  38. 38. SQL Conversion- Other cases Ø Character Encoding issue. Ø Lower/Upper string with locale sensitivity Ø Decimal precision issue.[SPARK-22036] Ø "distribute by" on multiple columns may lead to codeGen issue.[SPARK-25084] Ø Datasource partition table should load empty static partitions[SPARK-24937] …. 38
  39. 39. Recursive Query Conversion Ø Spark-SQL does not support recursive query yet at this moment. SPARK-24497 Ø We can make it with Spark DataFrame API. 39
  40. 40. Recursive query use case 40 with recursive employee_managers as (select emplyee_no, manager_no from employees union all SELECT a.employee_no, b.manager_no from employee_managers a join employees b on a.manager_no = b.employee_no )select * from employee_managers;
  41. 41. 41
  42. 42. Implementation – Key design Ø Pre Sort & Bucket the re-used table. Ø Write out the RDD data in each iteration. 42
  43. 43. Dataframe API performance 43 0 1 2 3 4 5 6 7 8 9 MPP DB Spark-SQL Recursive query execution runtime(hours)
  44. 44. Do Not Repeat Yourself ! • Can we make our life easier with all above pitfalls , best practices ? • DRY Principle and we need AUTOMATION ! 44
  45. 45. Agenda 45 Background Use Cases and Best Practices Auto Migration Deep Dive
  46. 46. Automation Scope 46 • ~5K Target tables • ~20K intermediate/working tables • ~22PB target tables • ~40PB relational data processing every day
  47. 47. Automation Workflow 47
  48. 48. Automation Framework 48
  49. 49. Automation Key components • Metadata Component • DDL Generator • SQL Convertor 49
  50. 50. Metadata 50 • Parse ADBMS transformation sql to build table dependency tree • Parse origin EDW server config file • Classify tables type to staging working target view • Get Table column definition from DBC
  51. 51. Metadata 51
  52. 52. DDL Generator • A tool to generate necessary tables’ DDL on Spark base on Metadata • Define table type and schema -- bucket/partition • Create data source for SparkSQL to adaptive multiple extract files • ADBMS SQL type vs SparkSQL type mapping 52
  53. 53. DDL Generator 53 ADBMS Table Model Spark Table Model Wrk.staging_a Wrk.working_a Tgt.target_a Wrk.staging_a Wrk.working_a Wrk.working_a_snpht Wrk.target_fin_w Tgt.target_a Tgt.target_a_merge
  54. 54. DDL Generator Sample 54#UnifiedAnalytics #SparkAISummit
  55. 55. SQL Convertor – Architecture 55
  56. 56. SQL Converter - ANTLR 56 • ANTLR -- ANother Tool for Language Recognition • Custom ANTLR Lexer/Parser to recognize MPP SQL
  57. 57. SQL Converter - ANTLR 57 • ANTLR -- ANother Tool for Language Recognition
  58. 58. SQL Converter – Rule Engine • Identify SQL query pattern first and then do conversion based on conversion rules. • Convert single update/delete/insert into one insert-overwrite step. • Multiple update/delete cases – store intermediate step results into temp view and then do final single merge. • Identify the column default value, table type. (Eg: set table for dedup…) • Convert functions based on mapping. • Bridge the gap like case sensitivity issue and date time expression. 58
  59. 59. SQL Convertor Examples 59 ADBMS SQL
  60. 60. SQL Convertor Examples 60 ADBMS SQL
  61. 61. SQL Convertor Examples 61
  62. 62. Be part of community> 100 issues reported to community during migration and still working with community Case-insensitive field resolution • SPARK-25132 Case-insensitive field resolution when reading from Parquet • SPARK-25175 Field resolution should fail if there's ambiguity for ORC native reader • SPARK-25207 Case-insensitive field resolution for filter pushdown when reading Parquet Parquet filter pushdown • SPARK-23727 Support DATE predict push down in parquet • SPARK-24716 Refactor ParquetFilters • SPARK-24706 Support ByteType and ShortType pushdown to parquet • SPARK-24549 Support DecimalType push down to the parquet data sources • SPARK-24718 Timestamp support pushdown to parquet data source • SPARK-24638 StringStartsWith support push down • SPARK-17091 Convert IN predicate to equivalent Parquet filter UDF Improvement • SPARK-23900 format_number udf should take user specifed format as argument • SPARK-23903Add support for date extract • SPARK-23905 Add UDF weekday Bugs • SPARK-24076 very bad performance when shuffle.partition = 8192 • SPARK-24556 ReusedExchange should rewrite output partitioning also when child's partitioning is RangePartitioning • SPARK-25084 "distribute by" on multiple columns may lead to codegen issue • SPARK-25368 Incorrect constraint inference returns wrong result Enhancement • [SPARK-26004][SQL] InMemoryTable support StartsWith predicate push down • [SPARK-24570][SQL] Implement Spark own GetTablesOperation • [SPARK-24196][SQL] Implement Spark's own GetSchemasOperation • [SPARK-25269][SQL] SQL interface support specify StorageLevel when cache table Hive Version Upgrading • [SPARK-23710][SQL] Upgrade the built-in Hive to 2.3.4 for hadoop-3.2 62
  63. 63. THANKS!