BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

2,359 views
2,042 views

Published on

High-level use case description of one department of a hospital, and comparisons of two solutions : 1) Big data solution using Cloudera Impala; and 2) Traditional RDBMS solution using Oracle DB.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,359
On SlideShare
0
From Embeds
0
Number of Embeds
48
Actions
Shares
0
Downloads
34
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
  • ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
  • Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  • Base de donnéescontenant les données d’ analyse de test des spécimens des patients avec les résultats.Faire des requêtes analytiques sur la base de donnée en production est très lent et peut interférer avec le fonctionnement normal avec
  • NE PARLERONS PAS DE : Extraction des exigences
  • 25% plusrapide avec compression Snappy (5.5X compression)Impala 80% plus rapidequ’Oracle
  • ChoisirShaun : échelle plus petite, besoin immédiat, permet de tester la technologie
  • BDM9 - Comparison of Oracle RDBMS and Cloudera Impala for a hospital use case

    1. 1. EndoMine SystemJewish General Hospitalby David Lauzonand Anton ZakharovBig Data Montreal #9February 5th 2013 1 / 18
    2. 2. Presentation• Our Objectives• Requirements and context• Project scope• Hadoop Solution – Big Data Solution Overview – Hive Table Schema – Compression Performance – Data Architecture in Hadoop – Hadoop/Impala Prototype Demo• Oracle Solution• Hadoop vs Oracle comparison• What are expensive queries? 2 / 18
    3. 3. Our Objectives• Lead an end-of-study project in an industrial context – Requirements elicitation – Implement a « proof-of-concept » prototype• Experiment with big data technologies – Compare with RDBMS 3 / 18
    4. 4. Requirements and context• Department of Medical Diagnostic (medical test results DB, e.g. blood, urine, ...) – Dr. Shaun Eintracht • « ad hoc » Query • ETL Query – Dr. Elizabeth Mac Namara • « business intelligence » requirements • Realtime Dashboard• Department of Endocrinology – Dr. Mark Trifiro • Data mining 4 / 18
    5. 5. Project scope• First iteration = improve ad-hoc queries – Slow analytical queries and ETL (MS Access) – Risk of « crashing » production DB – Some queries impossible to process 5 / 18
    6. 6. Production DB (Oracle) 6 / 18
    7. 7. Solutions• Solution 1 : Hadoop + Impala• Solution 2 : Tune the existing Oracle RDBMS 7 / 18
    8. 8. Big Data Solution Overview 8 / 18
    9. 9. Hive Table Schema 9 / 18
    10. 10. Compression Performance250200150 Impala100 Hive Oracle50 0 Oracle FS Text File Sequence SeqFile + SeqFile + File Gzip Snappy 10 / 18
    11. 11. Data Architecture in Hadoop• All big tables are pre-joined – With specimen (1) – Without specimen (2)• Partitioned using two schemes – Year-month (3) – Year and Test (4)• 4 different versions of the same data: – stay_order_results_yearmonth – stay_order_results_year_and_test – stay_order_results_specimen_yearmonth – stay_order_results_specimen_year_and_test 11 / 18
    12. 12. Hadoop Prototype Demo 12 / 18
    13. 13. Oracle Solution• Same tables as source DB – A big pre-joined table is not a good solution• Techniques explored : – Partitioning • Partitions automatically created – Compression • Inefficient for joins – Clustering – Join multiple partitioned tables 13 / 18
    14. 14. Oracle Solution (continued)• Avoid too many indexes on the big tables: – Takes a lot of memory – Slow to create – May not be used if query use more than 5% of the rows 14 / 18
    15. 15. Comparison: Hadoop Solution• Pro – Crunch massive amount of data – Scalability – Free software• Cons – Needs better UI and tune-ups – Maintenance cost – Require ETL time to merge data into one table – BIG Joins should be avoided 15 / 18
    16. 16. Comparison: Oracle Solution• Pro – Just need to create a slave DB (just?) – Faster random-lookup – Easier to find expertise• Cons – Scalability up to a certain point.. – Synchronisation with master DB: • Rebuilding indexes would take hours 16 / 18
    17. 17. What are expensive queries?• If possible, avoid these constructs on large result sets – SELECT DISTINCT – ORDER BY – GROUP BY – JOIN big table with another big table • JOIN big table with multiple small tables should be OK 17 / 18
    18. 18. Conclusion• Recommendation to use a “classic” RDBMS – The database fit on a single-node – Existing expertise in-house – Acceptable performance with appropriate tune-ups – Stop using MS Access• Disadvantage : limited scalability 18 / 18

    ×