Integrated Data Warehouse with Hadoop and Oracle Database

5,357 views
4,942 views

Published on

Use cases and advice on integrating Hadoop with the enterprise data warehouse

Published in: Technology

Integrated Data Warehouse with Hadoop and Oracle Database

  1. 1. Building the Integrated DataWarehouseWith Oracle Database and HadoopGwen Shapira, Senior Consultant
  2. 2. Why Pythian•  Recognized Leader: •  Global industry leader in data infrastructure managed services and consulting with expertise in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems administration •  Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deployments•  Expertise: •  One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8 Oracle ACEs/ACE Directors •  Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC•  Global Reach & Scalability: •  24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response © 2012 – Pythian
  3. 3. About Gwen Shapira •  Oracle ACE Director •  13 Years with pager •  7 as Oracle DBA •  Senior Consultant: •  Has MacBook, will travel. •  @gwenshap •  http://www.pythian.com/news/author/ shapira/ © 2012 – Pythian
  4. 4. Agenda•  What is Big Data?•  Why do we care about Big Data?•  Why your DWH needs Hadoop?•  Examples of Hadoop in the DWH•  How to integrate Hadoop into your DWH•  Avoiding major pitfalls © 2012 – Pythian
  5. 5. What is Big Data?
  6. 6. MORE DATA THAN YOUCAN HANDLE © 2012 – Pythian
  7. 7. MORE DATA THANRELATIONAL DATABASESCAN HANDLE © 2012 – Pythian
  8. 8. MORE DATA THANRELATIONAL DATABASESCAN HANDLE CHEAPLY © 2012 – Pythian
  9. 9. Data Arriving at fast RatesTypically unstructuredStored without aggregationAnalyzed in Real TimeFor Reasonable Cost © 2012 – Pythian
  10. 10. Where does Big Data come from?• Social media• Enterprise transactional data• Consumer behaviour• Multimedia• Sensors and embedded devices• Network devices © 2012 – Pythian
  11. 11. Why all the Excitement? © 2012 – Pythian
  12. 12. Complex Data Architecture © 2012 – Pythian
  13. 13. Your DWH needs Hadoop
  14. 14. Big Problems with Big Data•  It is: •  Unstructured •  Unprocessed •  Un-aggregated •  Un-filtered •  Repetitive •  And generally messy. Oh, and there is a lot of it. © 2012 – Pythian
  15. 15. Technical Challenges•  Storage capacity•  Storage throughput Scalable storage •  Pipeline throughput•  Processing power•  Parallel processing Massive Parallel Processing •  System Integration•  Data Analysis Ready to use tools © 2012 – Pythian
  16. 16. Hadoop PrinciplesBring Code to Data Share Nothing © 2012 – Pythian
  17. 17. Hadoop in a NutshellHDFS: Map-Reduce:Replicated Distributed Big-Data File Framework for writing massively parallel jobsSystem © 2012 – Pythian
  18. 18. Hadoop Benefits•  Reliable solution based on unreliable hardware•  Designed for large files•  Load data first, structure later•  Designed to maximize throughput of large scans•  Designed to maximize parallelism•  Designed to scale•  Flexible development platform•  Solution Ecosystem © 2012 – Pythian
  19. 19. Hadoop Limitations•  Hadoop is scalable but not fast•  Batteries not included•  Instrumentation not included either•  Well-known reliability limitations © 2012 – Pythian
  20. 20. Hadoop in the Data WarehouseUse Cases and Customer Stories
  21. 21. ETL for Unstructured Data Logs Flume HadoopWeb servers, Cleanup, app server, aggregationclickstreams Longterm storage DWH BI, batch reports © 2012 – Pythian
  22. 22. ETL for Structured Data Sqoop, Hadoop OLTP Oracle, Perl Transformation MySQL, aggregation Informix… Longterm storage DWH BI, batch reports © 2012 – Pythian
  23. 23. Bring the World into Your Datacenter © 2012 – Pythian
  24. 24. Rare Historical Report © 2012 – Pythian
  25. 25. Find Needle in Haystack © 2012 – Pythian
  26. 26. We are not doing SQL anymore © 2012 – Pythian
  27. 27. Connecting the (big) Dots
  28. 28. Sqoop Queries © 2012 – Pythian
  29. 29. Sqoop is Flexible (for import) • Select columns from table where condition • Or write your own query • Split column • Parallel • Incremental • File formats © 2012 – Pythian
  30. 30. Sqoop Import Examples•  Sqoop  import  -­‐-­‐connect  jdbc:oracle:thin:@//dbserver: 1521/masterdb     -­‐-­‐username  hr  -­‐-­‐table  emp     -­‐-­‐where  “start_date    ’01-­‐01-­‐2012’”    •  Sqoop  import  jdbc:oracle:thin:@//dbserver:1521/masterdb     -­‐-­‐username  myuser     -­‐-­‐table  shops  -­‐-­‐split-­‐by  shop_id     Must be indexed or -­‐-­‐num-­‐mappers  16   partitioned to avoid 16 full table scans © 2012 – Pythian
  31. 31. Less Flexible Export • 100row batch inserts • Commit every 100 batches • Parallel export • Update mode Example: sqoop  export     -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb     -­‐-­‐table  bar     -­‐-­‐export-­‐dir  /results/bar_data   © 2012 – Pythian
  32. 32. Fuse-DFS•  Mount HDFS on Oracle server: •  sudo yum install hadoop-0.20-fuse •  hadoop-fuse-dfs dfs:// name_node_hostname:namenode_port mount_point•  Use external tables to load data into Oracle•  File Formats may vary•  All ETL best practices apply © 2012 – Pythian
  33. 33. Oracle Loader for Hadoop•  Load data from Hadoop into Oracle•  Map-Reduce job inside Hadoop•  Converts data types.•  Partitions and sorts•  Direct path loads•  Reduces CPU utilization on database © 2012 – Pythian
  34. 34. Oracle Direct Connector to HDFS•  Create external tables of files in HDFS•  PREPROCESSOR  HDFS_BIN_PATH:hdfs_stream  •  All the features of External Tables•  Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS © 2012 – Pythian
  35. 35. Big Data Appliance and Exadata © 2012 – Pythian
  36. 36. How not to Fail
  37. 37. Data that belongs in RDBMS © 2012 – Pythian
  38. 38. Prepare for Migration © 2012 – Pythian
  39. 39. Use Hadoop Efficiently•  Understand your bottlenecks: •  CPU, storage or network?•  Reduce use of temporary data: •  All data is over the network •  Written to disk in triplicate.•  Eliminate unbalanced workloads•  Offload work to RDBMS•  Fine-tune optimization with Map-Reduce © 2012 – Pythian
  40. 40. Your Data is NOT as BIG as you think© 2012 – Pythian
  41. 41. Getting Started• Pick a Business Problem•  Acquire Data• Use right tool for the job•  Hadoop can start on the cheap•  Integrate the systems• Analyze data•  Get operational © 2012 – Pythian
  42. 42. Thank you and QATo contact us… sales@pythian.com 1-877-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/163902527671 @pythian @pythianjobs http://www.linkedin.com/company/pythian © 2012 – Pythian

×