Building the Integrated DataWarehouseWith Oracle Database and HadoopGwen Shapira, Senior Consultant
Why Pythian•  Recognized Leader:   •  Global industry leader in data infrastructure managed services and consulting with e...
About Gwen Shapira             •  Oracle ACE Director             •  13 Years with pager                   •    7 as Oracl...
Agenda•  What is Big Data?•  Why do we care about Big Data?•  Why your DWH needs Hadoop?•  Examples of Hadoop in the DWH• ...
What is Big Data?
MORE DATA THAN YOUCAN HANDLE        © 2012 – Pythian
MORE DATA THANRELATIONAL DATABASESCAN HANDLE         © 2012 – Pythian
MORE DATA THANRELATIONAL DATABASESCAN HANDLE CHEAPLY          © 2012 – Pythian
Data Arriving at fast RatesTypically unstructuredStored without aggregationAnalyzed in Real TimeFor Reasonable Cost       ...
Where does Big Data come from?• Social media• Enterprise transactional data• Consumer behaviour• Multimedia• Sensors and e...
Why all the Excitement?                   © 2012 – Pythian
Complex Data Architecture           © 2012 – Pythian
Your DWH needs Hadoop
Big Problems with Big Data•  It is:  •  Unstructured  •  Unprocessed  •  Un-aggregated  •  Un-filtered  •  Repetitive  •  ...
Technical Challenges•  Storage capacity•  Storage throughput         Scalable storage	•  Pipeline throughput•  Processing ...
Hadoop PrinciplesBring Code to Data                      Share Nothing                     © 2012 – Pythian
Hadoop in a NutshellHDFS:                                           Map-Reduce:Replicated Distributed Big-Data File       ...
Hadoop Benefits•  Reliable solution based on unreliable hardware•  Designed for large files•  Load data first, structure l...
Hadoop Limitations•  Hadoop is scalable but not fast•  Batteries not included•  Instrumentation not included either•  Well...
Hadoop in the Data WarehouseUse Cases and Customer Stories
ETL for Unstructured Data   Logs        Flume                        HadoopWeb servers,                                  C...
ETL for Structured Data               Sqoop,                        Hadoop    OLTP     Oracle,    Perl                    ...
Bring the World into Your Datacenter                    © 2012 – Pythian
Rare Historical Report                    © 2012 – Pythian
Find Needle in Haystack                    © 2012 – Pythian
We are not doing SQL anymore                   © 2012 – Pythian
Connecting the (big) Dots
Sqoop        Queries	                 © 2012 – Pythian
Sqoop is Flexible (for import) • Select        columns from table where condition • Or write your own query • Split   colu...
Sqoop Import Examples•  Sqoop	  import	  -­‐-­‐connect	  jdbc:oracle:thin:@//dbserver:   1521/masterdb	  	     -­‐-­‐usern...
Less Flexible Export • 100row batch inserts • Commit every 100 batches • Parallel   export • Update     mode Example: sqoo...
Fuse-DFS•  Mount HDFS on Oracle server: •  sudo   yum install hadoop-0.20-fuse •  hadoop-fuse-dfs               dfs://  na...
Oracle Loader for Hadoop•  Load data from Hadoop into Oracle•  Map-Reduce job inside Hadoop•  Converts data types.•  Parti...
Oracle Direct Connector to HDFS•  Create external tables of files in HDFS•  PREPROCESSOR	  HDFS_BIN_PATH:hdfs_stream	  •  ...
Big Data Appliance and Exadata                    © 2012 – Pythian
How not to Fail
Data that belongs in RDBMS                   © 2012 – Pythian
Prepare for Migration                    © 2012 – Pythian
Use Hadoop Efficiently•  Understand your bottlenecks: •  CPU,    storage or network?•  Reduce use of temporary data: •  Al...
Your Data               is   NOT              as    BIG as you think© 2012 – Pythian
Getting Started• Pick a Business Problem•  Acquire Data• Use right tool for the job•  Hadoop can start on the cheap•  Inte...
Thank you and QATo contact us…    sales@pythian.com	    1-877-PYTHIAN	To follow us…    http://www.pythian.com/news/	    ht...
Upcoming SlideShare
Loading in...5
×

Integrated Data Warehouse with Hadoop and Oracle Database

4,204

Published on

Use cases and advice on integrating Hadoop with the enterprise data warehouse

Published in: Technology

Integrated Data Warehouse with Hadoop and Oracle Database

  1. 1. Building the Integrated DataWarehouseWith Oracle Database and HadoopGwen Shapira, Senior Consultant
  2. 2. Why Pythian•  Recognized Leader: •  Global industry leader in data infrastructure managed services and consulting with expertise in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems administration •  Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deployments•  Expertise: •  One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8 Oracle ACEs/ACE Directors •  Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC•  Global Reach & Scalability: •  24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response © 2012 – Pythian
  3. 3. About Gwen Shapira •  Oracle ACE Director •  13 Years with pager •  7 as Oracle DBA •  Senior Consultant: •  Has MacBook, will travel. •  @gwenshap •  http://www.pythian.com/news/author/ shapira/ © 2012 – Pythian
  4. 4. Agenda•  What is Big Data?•  Why do we care about Big Data?•  Why your DWH needs Hadoop?•  Examples of Hadoop in the DWH•  How to integrate Hadoop into your DWH•  Avoiding major pitfalls © 2012 – Pythian
  5. 5. What is Big Data?
  6. 6. MORE DATA THAN YOUCAN HANDLE © 2012 – Pythian
  7. 7. MORE DATA THANRELATIONAL DATABASESCAN HANDLE © 2012 – Pythian
  8. 8. MORE DATA THANRELATIONAL DATABASESCAN HANDLE CHEAPLY © 2012 – Pythian
  9. 9. Data Arriving at fast RatesTypically unstructuredStored without aggregationAnalyzed in Real TimeFor Reasonable Cost © 2012 – Pythian
  10. 10. Where does Big Data come from?• Social media• Enterprise transactional data• Consumer behaviour• Multimedia• Sensors and embedded devices• Network devices © 2012 – Pythian
  11. 11. Why all the Excitement? © 2012 – Pythian
  12. 12. Complex Data Architecture © 2012 – Pythian
  13. 13. Your DWH needs Hadoop
  14. 14. Big Problems with Big Data•  It is: •  Unstructured •  Unprocessed •  Un-aggregated •  Un-filtered •  Repetitive •  And generally messy. Oh, and there is a lot of it. © 2012 – Pythian
  15. 15. Technical Challenges•  Storage capacity•  Storage throughput Scalable storage •  Pipeline throughput•  Processing power•  Parallel processing Massive Parallel Processing •  System Integration•  Data Analysis Ready to use tools © 2012 – Pythian
  16. 16. Hadoop PrinciplesBring Code to Data Share Nothing © 2012 – Pythian
  17. 17. Hadoop in a NutshellHDFS: Map-Reduce:Replicated Distributed Big-Data File Framework for writing massively parallel jobsSystem © 2012 – Pythian
  18. 18. Hadoop Benefits•  Reliable solution based on unreliable hardware•  Designed for large files•  Load data first, structure later•  Designed to maximize throughput of large scans•  Designed to maximize parallelism•  Designed to scale•  Flexible development platform•  Solution Ecosystem © 2012 – Pythian
  19. 19. Hadoop Limitations•  Hadoop is scalable but not fast•  Batteries not included•  Instrumentation not included either•  Well-known reliability limitations © 2012 – Pythian
  20. 20. Hadoop in the Data WarehouseUse Cases and Customer Stories
  21. 21. ETL for Unstructured Data Logs Flume HadoopWeb servers, Cleanup, app server, aggregationclickstreams Longterm storage DWH BI, batch reports © 2012 – Pythian
  22. 22. ETL for Structured Data Sqoop, Hadoop OLTP Oracle, Perl Transformation MySQL, aggregation Informix… Longterm storage DWH BI, batch reports © 2012 – Pythian
  23. 23. Bring the World into Your Datacenter © 2012 – Pythian
  24. 24. Rare Historical Report © 2012 – Pythian
  25. 25. Find Needle in Haystack © 2012 – Pythian
  26. 26. We are not doing SQL anymore © 2012 – Pythian
  27. 27. Connecting the (big) Dots
  28. 28. Sqoop Queries © 2012 – Pythian
  29. 29. Sqoop is Flexible (for import) • Select columns from table where condition • Or write your own query • Split column • Parallel • Incremental • File formats © 2012 – Pythian
  30. 30. Sqoop Import Examples•  Sqoop  import  -­‐-­‐connect  jdbc:oracle:thin:@//dbserver: 1521/masterdb     -­‐-­‐username  hr  -­‐-­‐table  emp     -­‐-­‐where  “start_date    ’01-­‐01-­‐2012’”    •  Sqoop  import  jdbc:oracle:thin:@//dbserver:1521/masterdb     -­‐-­‐username  myuser     -­‐-­‐table  shops  -­‐-­‐split-­‐by  shop_id     Must be indexed or -­‐-­‐num-­‐mappers  16   partitioned to avoid 16 full table scans © 2012 – Pythian
  31. 31. Less Flexible Export • 100row batch inserts • Commit every 100 batches • Parallel export • Update mode Example: sqoop  export     -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb     -­‐-­‐table  bar     -­‐-­‐export-­‐dir  /results/bar_data   © 2012 – Pythian
  32. 32. Fuse-DFS•  Mount HDFS on Oracle server: •  sudo yum install hadoop-0.20-fuse •  hadoop-fuse-dfs dfs:// name_node_hostname:namenode_port mount_point•  Use external tables to load data into Oracle•  File Formats may vary•  All ETL best practices apply © 2012 – Pythian
  33. 33. Oracle Loader for Hadoop•  Load data from Hadoop into Oracle•  Map-Reduce job inside Hadoop•  Converts data types.•  Partitions and sorts•  Direct path loads•  Reduces CPU utilization on database © 2012 – Pythian
  34. 34. Oracle Direct Connector to HDFS•  Create external tables of files in HDFS•  PREPROCESSOR  HDFS_BIN_PATH:hdfs_stream  •  All the features of External Tables•  Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS © 2012 – Pythian
  35. 35. Big Data Appliance and Exadata © 2012 – Pythian
  36. 36. How not to Fail
  37. 37. Data that belongs in RDBMS © 2012 – Pythian
  38. 38. Prepare for Migration © 2012 – Pythian
  39. 39. Use Hadoop Efficiently•  Understand your bottlenecks: •  CPU, storage or network?•  Reduce use of temporary data: •  All data is over the network •  Written to disk in triplicate.•  Eliminate unbalanced workloads•  Offload work to RDBMS•  Fine-tune optimization with Map-Reduce © 2012 – Pythian
  40. 40. Your Data is NOT as BIG as you think© 2012 – Pythian
  41. 41. Getting Started• Pick a Business Problem•  Acquire Data• Use right tool for the job•  Hadoop can start on the cheap•  Integrate the systems• Analyze data•  Get operational © 2012 – Pythian
  42. 42. Thank you and QATo contact us… sales@pythian.com 1-877-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/163902527671 @pythian @pythianjobs http://www.linkedin.com/company/pythian © 2012 – Pythian
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×