Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

2,986 views

Published on

High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

  1. 1. EndoMine SystemJewish General Hospitalby David Lauzonand Anton ZakharovBig Data Montreal #9February 5th 2013 1 / 18
  2. 2. Presentation• Our Objectives• Requirements and context• Project scope• Hadoop Solution – Big Data Solution Overview – Hive Table Schema – Compression Performance – Data Architecture in Hadoop – Hadoop/Impala Prototype Demo• Oracle Solution• Hadoop vs Oracle comparison• What are expensive queries? 2 / 18
  3. 3. Our Objectives• Lead an end-of-study project in an industrial context – Requirements elicitation – Implement a « proof-of-concept » prototype• Experiment with big data technologies – Compare with RDBMS 3 / 18
  4. 4. Requirements and context• Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...) – Dr. Shaun Eintracht • « ad hoc » Query • ETL Query – Dr. Elizabeth Mac Namara • « business intelligence » requirements • Realtime Dashboard• Department of Endocrinology – Dr. Mark Trifiro • Data mining 4 / 18
  5. 5. Project scope• First iteration = improve ad-hoc queries – Slow analytical queries and ETL (MS Access) – Risk of « crashing » production DB – Some queries impossible to process 5 / 18
  6. 6. Production DB (Oracle) 6 / 18
  7. 7. Solutions• Solution 1 : Hadoop + Impala• Solution 2 : Tune the existing Oracle RDBMS 7 / 18
  8. 8. Big Data Solution Overview 8 / 18
  9. 9. Hive Table Schema 9 / 18
  10. 10. Compression Performance250200150 Impala100 Hive Oracle50 0 Oracle FS Text File Sequence SeqFile + SeqFile + File Gzip Snappy 10 / 18
  11. 11. Data Architecture in Hadoop• All big tables are pre-joined – With specimen (1) – Without specimen (2)• Partitioned using two schemes – Year-month (3) – Year and Test (4)• 4 different versions of the same data: – stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test 11 / 18
  12. 12. Hadoop Prototype Demo 12 / 18
  13. 13. Oracle Solution• Same tables as source DB – A big pre-joined table is not a good solution• Techniques explored : – Partitioning • Partitions automatically created – Compression • Inefficient for joins – Clustering – Join multiple partitioned tables 13 / 18
  14. 14. Oracle Solution (continued)• Avoid too many indexes on the big tables: – Takes a lot of memory – Slow to create – May not be used if query use more than 5% of the rows 14 / 18
  15. 15. Comparison: Hadoop Solution• Pro – Crunch massive amount of data – Scalability – Free software• Cons – Needs better UI and tune-ups – Maintenance cost – Require ETL time to merge data into one table – BIG Joins should be avoided 15 / 18
  16. 16. Comparison: Oracle Solution• Pro – Just need to create a slave DB (just?) – Faster random-lookup – Easier to find expertise• Cons – Scalability up to a certain point.. – Synchronisation with master DB: • Rebuilding indexes would take hours 16 / 18
  17. 17. What are expensive queries?• If possible, avoid these constructs on large result sets – SELECT DISTINCT – ORDER BY – GROUP BY – JOIN big table with another big table • JOIN big table with multiple small tables should be OK 17 / 18
  18. 18. Conclusion• Recommendation to use a “classic” RDBMS – The database fit on a single-node – Existing expertise in-house – Acceptable performance with appropriate tune-ups – Stop using MS Access• Disadvantage : limited scalability 18 / 18

×