Integrated dwh 3


Published on

  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • We are a managed service AND a solution provider of elite database and System Administration skills in Oracle, MySQL and SQL Server
  • Trying to separate data into “small” and “big” is not a useful segmentation. Lets look at structure, processing and data sources instead.
  • I am going to show a lot of examples of how Hadoop is used to store and process data. I don’t want anyone to tell me “But I can do it in Oracle”. I know you can, and so can I. But there is no point in using Oracle where is it less efficient than other solutions or when it doesn’t have any specific advantage.
  • Big data is not called big data because it fits well into a thumb-drive.It requires a lot of storage, partially because it’s a lot of data. Partially because it is unstructured, unprocessed, un-aggregated, repetitive and generally messy
  • The ideas are simple:Data is big, code is small. It is far more efficient to move the small code to the big data than vice versa. It you have a pillow and a sofa in a room, you typically move the pillow to the sofa, and not vice versa. But many developers are too comfortable with the select-than-process anti-pattern. This principle is in place to help with the throughput challenges.2. Sharing is nice, but safe sharing of data typically mean locking, queueing, bottle necks and race conditions. It is notoriously difficult to get concurrency right, and even if you do – it is slower than the alternative. Hadoop works around the whole thing. This principle is in place to deal with parallel processing challenges
  • Default block size is 64M,You can place any data file in HDFS. Later processing can find the meaning in the data.
  • Many projects fail because people imagine a very rosy image of Hadoop. They think they can just throw all the data there and it will magically and quickly become value. Such misguided expectations also happen with other platforms and doom other projects too. To be successful with Hadoop, we need to be realistic about it.
  • Note that while this presentation shows use cases where Hadoop is used to enhance the enterprise data warehouse, there are many many examples where Hadoop is the backend of an entire product. Search engines and recommendation engines are such examples. While those are important use cases, they are out of scope for this presentation.
  • Much more controversial, especially when going from Oracle to Oracle. The data is clearly structured, so why can’t we use RDBMS (either at OLTP or DW side) to process it?Hadoop makes sense:When structured data needs integration with unstructured data before loadingWhen the ETL part of the process doesn’t scale. If 24h of data processing take more than 24h, the choice is either a bigger database or more Hadoop nodes. If the rest of the database workload scales, Hadoop is an attractive option.When Hadoop replaces homegrown Perl file processing system. It is more centralized, easier to manage and scale better.
  • There is a lot of interesting data that is not generated by your company.Listings of businesses in specific locations.Connections in social mediaThe data may be un-structured, semi-structured or even structured. but it isn’t structured in the way your DWH expects and needs.We need a landing pad for cleanup, pre-processing, aggregating, filtering and structuring.Hadoop is perfect for this.Mappers can scrape data from websites efficiently.Map-reduce jobs that cleanup and process the data.And then load the results into your DWH.
  • We want the top 3 items bought by left handed women between ages of 21 and 23, on November 15, 1998.How long it will take you to answer this question? For one of my customers, the answer is 25 minutes.As data grows older, it usually becomes less valuable to the business, and it gets aggregated and shelved off to tapes or other cheap storage. This means that for many organizations, answering details questions about events that happened more than few month ago is impossible or at least very challenging. The business learned to never ask those questions, because the answer is “you can’t”.Hadoop combines cheap storage and massive processing power, this allows us to store detailed history of our business, and to generate reports about it. And once the answer for questions about history is “You will have your data in 25 minutes” instead of “impossible”, the questions turn out to be less rare than we assumed.
  • 7 Petabytes of log file data3 lines point to the security hole that allowed a break-in last weekYour DWH has aggregated information from the logs. Maybe.Hadoop is very cost effective about storing data. Lots of cheap disks, easy to throw data in without pre-processing.Search the data when you need it.
  • Pythian is a remote DBA company. Many customers feel a bit anxious when they let people they haven’t even met into their most critical databases.One of the ways Pythian deals with this problem is by continuously recording the screen of the VM that DBAs use to connect to customer environments. Our customers have access to those videos and can replay them to check what the DBAs were doing.Our system also allows text search in the video. Perhaps you want to know if we ever issued “drop table” on the system before a critical table disappeared? Or perhaps you want to see how we handled ORA-4032 so you can learn how to do it yourself in the future?OCR of screen video capture from Pythian privileged access surveillance systemFlume streams raw frames from video captureMap-Reduce job runs OCR on frames and produces textMap-Reduce job identifies text changes from frame to frame and produces text stream with timestamp when it was on the screenOther Map-Reduce jobs mine text (nd keystrokes for insightsCredit Cart patternsSensitive commands (like DROP TABLE)Root accessUnusual activity patterns
  • Wake up everyone, this is the meat and potatoes of the presentation. How do we integrate the Hadoop potatoes with the DWH meat?
  • It is often said that the best way to succeed is to avoid failure for long enough. Here is some advice that will help your Hadoop projects avoid failure.
  • * If the data is structured, especially if it arrives from a relational database, it is highly likely that a relational database will process it more efficiently than Hadoop. After all, RDBMS were built for this, with many features to support data processing tasks.OLTP workload don’t work with Hadoop at all. Just don’t try.Anything real-time will not work well with Hadoop.Most BI tools don’t integrate with Hadoop at the moment.
  • Taking an ETL process that used to be in RDBMS and dropping it on Hadoop by exporting whole tables with Sqoop and using Hive to process the data is unlikely to be any faster. Getting value out of Hadoop involves evaluating the work, understanding bottlenecks and finding the most efficient solution. Either with Hadoop, relational database or both.
  • Bad schema design is not big dataUsing 8 year old hardware is not big dataNot having purging policy is not big dataNot configuring your database and operating system correctly is not big dataPoor data filtering is not big data eitherKeep the data you need and use. In a way that you can actually use it.If doing this requires cutting edge technology, excellent! But don’t tell me you need NoSQL because you don’t purge data and have un-optimized PL/SQL running on 10-yo hardware.
  • Integrated dwh 3

    1. 1. Building the Integrated DataWarehousewith Oracle Database and HadoopGwen Shapira, Senior Consultant
    2. 2. 13 years with a pagerOracle ACE DirectorOak table memberSenior consultant for Pythian@gwenshap © 2012 – Pythian
    3. 3. PythianRecognized Leader:• Global industry-leader in remote database administration services and consulting for Oracle, Oracle Applications, MySQL and Microsoft SQL Server• Work with over 165 multinational companies such as LinkShare Corporation, IGN Entertainment, CrowdTwist, TinyCo and Western Union to help manage their complex IT deploymentsExpertise:• One of the world’s largest concentrations of dedicated, full-time DBA expertise. Employ 7 Oracle ACEs/ACE Directors. Heavily involved in the MySQL community, driving the MySQL Professionals Group and sit on the IOUG Advisory Board for MySQL.• Hold 7 Specializations under Oracle Platinum Partner program, including Oracle Exadata, Oracle GoldenGate & Oracle RAC © 2012 – Pythian3
    4. 4. Agenda• What is Big Data?• Why do we care about Big Data?• Why your DWH needs Hadoop?• Examples of Hadoop in the DWH• How to integrate Hadoop into your DWH• Avoiding major pitfalls © 2012 – Pythian
    5. 5. What is Big Data?
    6. 6. Doesn’t Matter.We are here to discuss architectures.Not define market segments. © 2012 – Pythian
    7. 7. What Does Matter?Some data types are a bad fit forRDBMS.Some problems are a bad fit forRDBMS.We can call them BIG if you want.Data Warehouses have always beenBIG. © 2012 – Pythian
    8. 8. Given enough skill and money –Oracle can do anything.Lets talk about efficient solutions. © 2012 – Pythian
    9. 9. When RDBMS Makes no Sense?• Storing images and video• Processing images and video• Storing and processing other large files • PDFs, Excel files• Processing large blocks of natural language text • Blog posts, job ads, product descriptions• Processing semi-structured data • CSV, JSON, XML, log files • Sensor data © 2012 – Pythian
    10. 10. When RDBMS Makes no Sense?• Ad-hoc, exploratory analytics• Integrating data from external sources• Data cleanup tasks• Very advanced analytics (machine learning) © 2012 – Pythian
    11. 11. New Data Sources• Blog posts• Social media• Images• Videos• Logs from web applications• SensorsThey all have large potential valueBut they are awkward fit for traditional data warehouses © 2012 – Pythian
    12. 12. Your DWH needs Hadoop
    13. 13. Big Problems with Big Data• It is: • Unstructured • Unprocessed • Un-aggregated • Un-filtered • Repetitive • Low quality • And generally messy. Oh, and there is a lot of it. © 2012 – Pythian
    14. 14. Technical Challenges• Storage capacity• Storage throughput Scalable storage• Pipeline throughput• Processing power• Parallel processing Massive Parallel Processing• System Integration• Data Analysis Ready to use tools © 2012 – Pythian
    15. 15. Hadoop Principles Bring Code to Data Share Nothing © 2012 – Pythian
    16. 16. Hadoop in a NutshellReplicated Distributed Big- Map-Reduce - framework forData File System writing massively parallel jobs © 2012 – Pythian
    17. 17. Hadoop Benefits• Reliable solution based on unreliable hardware• Designed for large files• Load data first, structure later• Designed to maximize throughput of large scans• Designed to maximize parallelism• Designed to scale• Flexible development platform• Solution Ecosystem © 2012 – Pythian
    18. 18. Hadoop Limitations• Hadoop is scalable but not fast• Batteries not included• Instrumentation not included either• Well-known reliability limitations © 2012 – Pythian
    19. 19. Hadoop In the Data WarehouseUse Cases and Customer Stories
    20. 20. ETL for Unstructured Data Logs Flume Hadoop Cleanup, Web servers, aggregation app server, Longterm storage clickstreams DWH BI, batch reports © 2012 – Pythian
    21. 21. ETL for Structured Data Sqoop, OLTP Perl Hadoop Transformation Oracle, aggregation MySQL, Longterm storage Informix… DWH BI, batch reports © 2012 – Pythian
    22. 22. Bring the World into Your Datacenter © 2012 – Pythian
    23. 23. Rare Historical Report © 2012 – Pythian
    24. 24. Find Needle in Haystack © 2012 – Pythian
    25. 25. We are not doing SQL anymore © 2012 – Pythian
    26. 26. Connecting the (big) Dots
    27. 27. Sqoop Queries © 2012 – Pythian
    28. 28. Sqoop is Flexible Import• Select <columns> from <table> where <condition>• Or <write your own query>• Split column• Parallel• Incremental• File formats © 2012 – Pythian
    29. 29. Sqoop Import Examples• Sqoop import --connect jdbc:oracle:thin:@//dbserver:1521/masterdb --username hr --table emp --where “start_date > ’01-01-2012’”• Sqoop import jdbc:oracle:thin:@//dbserver:1521/masterdb --username myuser --table shops --split-by shop_id --num-mappers 16 Must be indexed or partitioned to avoid 16 full table scans © 2012 – Pythian
    30. 30. Less Flexible Export• 100 row batch inserts• Commit every 100 batches• Parallel export• Merge vs. InsertExample:sqoop export--connect jdbc:mysql:// bar--export-dir /results/bar_data © 2012 – Pythian
    31. 31. FUSE-DFS• Mount HDFS on Oracle server: • sudo yum install hadoop-0.20-fuse • hadoop-fuse-dfs dfs://<name_node_hostname>:<namenode_port> <mount_point>• Use external tables to load data into Oracle• File Formats may vary• All ETL best practices apply © 2012 – Pythian
    32. 32. Oracle Loader for Hadoop• Load data from Hadoop into Oracle• Map-Reduce job inside Hadoop• Converts data types, partitions and sorts• Direct path loads• Reduces CPU utilization on database• NEW: • Support for Avro • Support for compression codecs © 2012 – Pythian
    33. 33. Oracle Direct Connector to HDFS• Create external tables of files in HDFS• PREPROCESSOR HDFS_BIN_PATH:hdfs_stream• All the features of External Tables• Tested (by Oracle) as 5 times faster (GB/s) than FUSE- DFS © 2012 – Pythian
    34. 34. Oracle SQL Connector for HDFS• Map-Reduce Java program• Creates an external table• Can use Hive Metastore for schema• Optimized for parallel queries• Supports Avro and compression © 2012 – Pythian
    35. 35. How not to Fail
    36. 36. Data That Belong in RDBMS © 2012 – Pythian
    37. 37. Prepare for Migration © 2012 – Pythian
    38. 38. Use Hadoop Efficiently• Understand your bottlenecks: • CPU, storage or network?• Reduce use of temporary data: • All data is over the network • Written to disk in triplicate.• Eliminate unbalanced workloads• Offload work to RDBMS• Fine-tune optimization with Map-Reduce © 2012 – Pythian
    39. 39. Your Data is NOT as BIG as you think© 2012 – Pythian
    40. 40. Getting Started•Pick a Business Problem• Acquire Data•Use right tool for the job• Hadoop can start on the cheap• Integrate the systems•Analyze data• Get operational © 2012 – Pythian
    41. 41. Thank you & Q&A To contact us… 1-866-PYTHIAN To follow us… © 2012 – Pythian4
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.