Integrated Data Warehouse with Hadoop and Oracle Database
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Integrated Data Warehouse with Hadoop and Oracle Database

on

  • 4,227 views

Use cases and advice on integrating Hadoop with the enterprise data warehouse

Use cases and advice on integrating Hadoop with the enterprise data warehouse

Statistics

Views

Total Views
4,227
Views on SlideShare
4,200
Embed Views
27

Actions

Likes
6
Downloads
181
Comments
0

5 Embeds 27

https://twitter.com 16
https://www.linkedin.com 4
https://si0.twimg.com 3
http://www.linkedin.com 3
http://instacurate.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Integrated Data Warehouse with Hadoop and Oracle Database Presentation Transcript

  • 1. Building the Integrated DataWarehouseWith Oracle Database and HadoopGwen Shapira, Senior Consultant
  • 2. Why Pythian•  Recognized Leader: •  Global industry leader in data infrastructure managed services and consulting with expertise in Oracle, Oracle Applications, Microsoft SQL Server, MySQL, big data and systems administration •  Work with over 200 multinational companies such as Forbes.com, Fox Sports, Nordion and Western Union to help manage their complex IT deployments•  Expertise: •  One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 8 Oracle ACEs/ACE Directors •  Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC•  Global Reach & Scalability: •  24/7/365 global remote support for DBA and consulting, systems administration, special projects or emergency response © 2012 – Pythian
  • 3. About Gwen Shapira •  Oracle ACE Director •  13 Years with pager •  7 as Oracle DBA •  Senior Consultant: •  Has MacBook, will travel. •  @gwenshap •  http://www.pythian.com/news/author/ shapira/ © 2012 – Pythian
  • 4. Agenda•  What is Big Data?•  Why do we care about Big Data?•  Why your DWH needs Hadoop?•  Examples of Hadoop in the DWH•  How to integrate Hadoop into your DWH•  Avoiding major pitfalls © 2012 – Pythian
  • 5. What is Big Data?
  • 6. MORE DATA THAN YOUCAN HANDLE © 2012 – Pythian
  • 7. MORE DATA THANRELATIONAL DATABASESCAN HANDLE © 2012 – Pythian
  • 8. MORE DATA THANRELATIONAL DATABASESCAN HANDLE CHEAPLY © 2012 – Pythian
  • 9. Data Arriving at fast RatesTypically unstructuredStored without aggregationAnalyzed in Real TimeFor Reasonable Cost © 2012 – Pythian
  • 10. Where does Big Data come from?• Social media• Enterprise transactional data• Consumer behaviour• Multimedia• Sensors and embedded devices• Network devices © 2012 – Pythian
  • 11. Why all the Excitement? © 2012 – Pythian
  • 12. Complex Data Architecture © 2012 – Pythian
  • 13. Your DWH needs Hadoop
  • 14. Big Problems with Big Data•  It is: •  Unstructured •  Unprocessed •  Un-aggregated •  Un-filtered •  Repetitive •  And generally messy. Oh, and there is a lot of it. © 2012 – Pythian
  • 15. Technical Challenges•  Storage capacity•  Storage throughput Scalable storage •  Pipeline throughput•  Processing power•  Parallel processing Massive Parallel Processing •  System Integration•  Data Analysis Ready to use tools © 2012 – Pythian
  • 16. Hadoop PrinciplesBring Code to Data Share Nothing © 2012 – Pythian
  • 17. Hadoop in a NutshellHDFS:  Map-Reduce:Replicated Distributed Big-Data File Framework for writing massively parallel jobsSystem © 2012 – Pythian
  • 18. Hadoop Benefits•  Reliable solution based on unreliable hardware•  Designed for large files•  Load data first, structure later•  Designed to maximize throughput of large scans•  Designed to maximize parallelism•  Designed to scale•  Flexible development platform•  Solution Ecosystem © 2012 – Pythian
  • 19. Hadoop Limitations•  Hadoop is scalable but not fast•  Batteries not included•  Instrumentation not included either•  Well-known reliability limitations © 2012 – Pythian
  • 20. Hadoop in the Data WarehouseUse Cases and Customer Stories
  • 21. ETL for Unstructured Data Logs Flume HadoopWeb servers, Cleanup, app server, aggregationclickstreams Longterm storage DWH BI, batch reports © 2012 – Pythian
  • 22. ETL for Structured Data Sqoop, Hadoop OLTP Oracle, Perl Transformation MySQL, aggregation Informix… Longterm storage DWH BI, batch reports © 2012 – Pythian
  • 23. Bring the World into Your Datacenter © 2012 – Pythian
  • 24. Rare Historical Report © 2012 – Pythian
  • 25. Find Needle in Haystack © 2012 – Pythian
  • 26. We are not doing SQL anymore © 2012 – Pythian
  • 27. Connecting the (big) Dots
  • 28. Sqoop Queries © 2012 – Pythian
  • 29. Sqoop is Flexible (for import) • Select <columns> from <table> where <condition> • Or <write your own query> • Split column • Parallel • Incremental • File formats © 2012 – Pythian
  • 30. Sqoop Import Examples•  Sqoop  import  -­‐-­‐connect  jdbc:oracle:thin:@//dbserver: 1521/masterdb     -­‐-­‐username  hr  -­‐-­‐table  emp     -­‐-­‐where  “start_date  >  ’01-­‐01-­‐2012’”    •  Sqoop  import  jdbc:oracle:thin:@//dbserver:1521/masterdb     -­‐-­‐username  myuser     -­‐-­‐table  shops  -­‐-­‐split-­‐by  shop_id     Must be indexed or -­‐-­‐num-­‐mappers  16   partitioned to avoid 16 full table scans © 2012 – Pythian
  • 31. Less Flexible Export • 100row batch inserts • Commit every 100 batches • Parallel export • Update mode Example: sqoop  export     -­‐-­‐connect  jdbc:oracle:thin:@//dbserver:1521/masterdb     -­‐-­‐table  bar     -­‐-­‐export-­‐dir  /results/bar_data   © 2012 – Pythian
  • 32. Fuse-DFS•  Mount HDFS on Oracle server: •  sudo yum install hadoop-0.20-fuse •  hadoop-fuse-dfs dfs:// <name_node_hostname>:<namenode_port> <mount_point>•  Use external tables to load data into Oracle•  File Formats may vary•  All ETL best practices apply © 2012 – Pythian
  • 33. Oracle Loader for Hadoop•  Load data from Hadoop into Oracle•  Map-Reduce job inside Hadoop•  Converts data types.•  Partitions and sorts•  Direct path loads•  Reduces CPU utilization on database © 2012 – Pythian
  • 34. Oracle Direct Connector to HDFS•  Create external tables of files in HDFS•  PREPROCESSOR  HDFS_BIN_PATH:hdfs_stream  •  All the features of External Tables•  Tested (by Oracle) as 5 times faster (GB/s) than FUSE-DFS © 2012 – Pythian
  • 35. Big Data Appliance and Exadata © 2012 – Pythian
  • 36. How not to Fail
  • 37. Data that belongs in RDBMS © 2012 – Pythian
  • 38. Prepare for Migration © 2012 – Pythian
  • 39. Use Hadoop Efficiently•  Understand your bottlenecks: •  CPU, storage or network?•  Reduce use of temporary data: •  All data is over the network •  Written to disk in triplicate.•  Eliminate unbalanced workloads•  Offload work to RDBMS•  Fine-tune optimization with Map-Reduce © 2012 – Pythian
  • 40. Your Data is NOT as BIG as you think© 2012 – Pythian
  • 41. Getting Started• Pick a Business Problem•  Acquire Data• Use right tool for the job•  Hadoop can start on the cheap•  Integrate the systems• Analyze data•  Get operational © 2012 – Pythian
  • 42. Thank you and Q&ATo contact us… sales@pythian.com 1-877-PYTHIAN To follow us… http://www.pythian.com/news/ http://www.facebook.com/pages/The-Pythian-Group/163902527671 @pythian @pythianjobs http://www.linkedin.com/company/pythian © 2012 – Pythian