• Save

Like this? Share it with your network

Share

Best Practices for Development Apps for Big Data

  • 466 views
Uploaded on

Best Practices for Development Apps for Big Data. Exadata, Exalytics, Big Data Appliance. Hadoop, HDFS, Using R with Oracle Database and Hadoop. Fast Data for Gathering Information.

Best Practices for Development Apps for Big Data. Exadata, Exalytics, Big Data Appliance. Hadoop, HDFS, Using R with Oracle Database and Hadoop. Fast Data for Gathering Information.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
466
On Slideshare
461
From Embeds
5
Number of Embeds
3

Actions

Shares
Downloads
0
Comments
0
Likes
2

Embeds 5

http://www.linkedin.com 2
https://twitter.com 2
https://www.linkedin.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • Start by introducing you to the platform. We’ll talk about use cases - and then zero-in on the use case that you will be working with as part of your HOLs. Frankly, across these use cases - you’ll find similarity in terms of data processing flows. We’ll review the Oracle MoviePlex design pattern/architecture.
  • In the rest of the presentation we’ll walk through the lifecycle of big data. So Big Data is all about making better business decisions to grow revenue and lower costs. The lifecycle of big data is acquire, organize, analyze, decide.
  • Platform consists of:Big Data Appliance to source unstructured/semi structured dataExadata to combine the data customer has now structured, alongside traditional schema-based data, running in-DB analytics on itExalytics for in-memory extreme analyticsAll connected by InfiniBand
  • Added standalone software componentsSo to summarize … I think we have the industry’s most complete and integrated solution for acquring, organizing, and analyzing big data.If someone comes up to you and needs you to deploy big datg in a few weeks, we can help you do this. Fastest time to value.We have the software – nosql db, em cc, hadoop, data integrator for hadoop, loader for hadoop, R, BIEE.Plus we have the Big Data Appliance, Exadata, and Exalytics to provide engineered solutions for running the software.In closing, I hope this session has been informative and you can now all go back to your organizations and tell them what big data is (hi vol low value), how it can be acquired, organized, loaded into your existing dw, and analyzed to bring new value to your business.
  • So you have BIG data. You’re running M/R on that data. You want to load or access some of that data on Oracle Database for further analytics. This is what the Oracle Big Data Connectors are for.Note that the data is transformed into a structured form before loaded/accessed by the connector.s
  • You have seen a similar slide in other big data presentations from Oracle, outlining the different stages in a big data application.The potential treasure trove of less structured data such as weblogs, social media, email, sensors, and location data can provide a wealth of useful information for business applications. Hadoop provides a massively parallel architecture to distill desired information from huge volumes of unstructured and semi-structured content. Frequently, this data needs to be analyzed with existing data in relational databases, the platform for most commercial applications. The two sets of data needs to be combined to enable users to derive greater insights from the less structured data that is processed and stored on Hadoop clusters, using the data in relational databases.A set of technologies and utilities referred to as “connectors” are necessary to make the data on Hadoop available to the database for analysis with the data in the database. Oracle Loader for Hadoop and Oracle SQL Connector for H-D-F-S (HDFS) are two high performance connectors to load and access very large volumes of data on Hadoop. We see here the different stages in a big data solution. Oracle has engineered solutions for each of these stages: Oracle Big Data Appliance, Oracle Exadata (an engineered system for running Oracle Database), and Oracle Exalytics (engineered system for BI applications), all connected by Infiniband, the super highway that integrates Oracle’s engineered systems. Note that the Big Data Connectors work both with the engineered systems and generic Hadoop and database installations (I will be discussing specific versions later in the presentation).Platform consists of:Big Data Appliance to source unstructured/semi structured dataExadata to combine the data customer has now structured, alongside traditional schema-based data, running in-DB analytics on itExalytics for in-memory extreme analytics====================================================All connected by InfiniBand---------------Set up for today’s conversation.Know a lot about Exadata & Exalytics (Oracle BI) – Been hard at work developing a key component to the Big Data Platform – the BDA. Excited to speak to you about that today.Leveraging both Oracle’s appliance expertise – and importantly we’re leveraging the advice and technology of industry experts – to develop to create an open platform. Although it’s new, it offers a solid foundation – using tech that is well tested by the biggest players in the market.And, we then took this open system and optimized it for Oracle – delivering unique capabilities that simplify connections to the rest of your Oracle ecosystem – plus delivers outstanding performance.Level set – Introduce the system – and then step thru a use case that illustrates the flow of info across the system. Highlight the optimizations along the way – things that are unique to Oracle.Platform consists of:Big Data Appliance to source unstructured/semi structured dataExadata to combine the data customer has now structured, alongside traditional schema-based data, running in-DB analytics on itExalytics for in-memory extreme analyticsAll connected by InfiniBand - a key enabler and an example of Oracle’s Superior TechnologyWithout InfiniBand ( without the super highway that integrates Oracle’s Engineered Solutions): Customers will try to squeeze all these capabilities into one box for either a Performance or Price advantage. - They will fail at bothWith InfiniBand, Customer have the right tool Optimized for the right job : The Value of integrated Oracle Solutions is greater than the sum of the parts
  • Connectors work with Oracle’s engineered systems and also with other Hadoop distributions and Oracle databases (as long as it is a version we support)
  • Parallelism:PQ slaves in the database will read data in parallelIf you have 64 pq slaves, 64 files will be read in parallel# of PQ slaves is limited by # of location files
  • When OSCH is invoked with the –createTable option,the external table definition is generated, the external table is created, and the location files are populated. You can examine the location files if you like. Their contents were also displayed on screen, along with the external table definition.
  • Interesting properties: tableName (name of the external table), sourceType, hive.tableName, hive.databaseName
  • Let us try some queries on this external table
  • You will see that the external table has two location files, because of the value we specified in the locationFileCount property. You can see that the URIs of the smaller data files have been grouped into one location file. OSCH does this to load balance the reading of data as much as possible. URIs in the location files are read in parallel. You can examine the location files if you like.
  • Interesting properties: tableName (external table that will be created),sourceType, dataPaths, locationFileCount
  • How does this perform? The alternative to OSCH, is to use Fuse-dfs. We are 5 times faster than Fuse-dfs, while using 75% less CPU.Test was performed on BDA (18 Sun x4270 M2 Servers, 216 cores, 48 GB memory per server (864 GB total)) and Exadata X2-8 single instance (8 Intel Xeon X7560 servers, 64 cores, 1TB memory)The data size used in the CPU usage graph is 0.25 TB.
  • OLH is a MapReduce job that runs on the Hadoop clusterJob submitted to the cluster like any MapReduce jobData is read through input formatsDatabase table partitions loaded in parallel by reducer tasksOnline and offline modesOnline: Pre-process and load in the same jobOffline: Write out data files on HDFS (text or Oracle Data Pump) for load laterThe data pre-processing performs partitioning, sorting, and data conversion on Hadoop.
  • Now let us look at Oracle Loader for Hadoop. In additional the file containing the configuration parameters, we have a loader map table that describes the columns in the target table we are loading into. If all columns in the table are loaded, and the data columns have the default date format, this file is not needed. Here the date format in the data is different from the default and is specified in the loader map file.
  • We first create the target table in the database that we want to load data into.
  • Themapreduce.outputformat.class specifies OCIOutputFormat. This specifies that the online load option with direct path load will be used.mapred.input.dir specifies the datapath for the data files. mapreduce.inputformat.class specifies that the data is in DelimitedTextFormat
  • Loader Map file. Note the specification of the date format.
  • We use 85% less CPU, and are more than ten times as fast.The data size used in the CPU usage graph is 0.25 TB.
  • This is a big deal. We spend significant time and effort keeping up with the versions. This saves you the time to make a connector work with a Hadoopdistro you are working with.
  • The connectors can be used together. Oracle Data Pump files can be created by Oracle Loader for Hadoop, and then accessed/loaded into Oracle Database using Oracle SQL Connector for HDFS.So if the text is not de-limited text files, the Oracle Loader for Hadoop can be first used to transform the data into data pump files (or de-limited text files), which are then loaded/accessed by Oracle SQL Connector for HDFS.This is also a good time to highlight the offline load option of OLH.

Transcript

  • 1. Developing a Successful Big Data Strategy Best Practices for Development Raul Goycoolea S. Solution Architect Manager Oracle Latin America Architecture Team Mexico Developer Day, Apr 2014
  • 2. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 122 <Insert Picture Here> Twitter http://twitter.com/raul_goycoolea Raul Goycoolea Seoane Keep in Touch Facebook http://www.facebook.com/raul.goycoolea Linkedin http://www.linkedin.com/in/raulgoy Blog http://blogs.oracle.com/raulgoy/ Raul Goycoolea S. Multiprocessor Programming 216 February 2012
  • 3. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 123 Agenda  Introduction  Architecture/Design Pattern  Use Cases
  • 4. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 124 Who are you? http://goo.gl/XkwxwM
  • 5. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 125 MEDIA/ ENTERTAINMENT Viewers / advertising effectiveness Cross Sell COMMUNICATIONS Location-based advertising EDUCATION & RESEARCH Experiment sensor analysis Retail / CPG Sentiment analysis Hot products Optimized Marketing HEALTH CARE Patient sensors, monitoring, EHRs Quality of care LIFE SCIENCES Clinical trials Genomics HIGH TECHNOLOGY / INDUSTRIAL MFG. Mfg quality Warranty analysis OIL & GAS Drilling exploration sensor analysis FINANCIAL SERVICES Risk & portfolio analysis New products AUTOMOTIVE Auto sensors reporting location, problems Games Adjust to player behavior In-Game Ads LAW ENFORCEMENT & DEFENSE Threat analysis - social media monitoring, photo analysis TRAVEL & TRANSPORTATION Sensor analysis for optimal traffic flows Customer sentiment UTILITIES Smart Meter analysis for network capacity, Sample of Big Data Use Cases Today ON-LINE SERVICES / SOCIAL MEDIA People & career matching Web-site optimization What is the main difference in this data? Volume, Velocity, Variety These Characteristics Challenge Existing Architectures
  • 6. Make Better Decisions Using Big Data Big Data In Action ANALYZE DECIDE ACQUIRE ORGANIZE
  • 7. Analyze all your data, at once Big Data in Action ANALYZE DECIDE ACQUIRE ORGANIZEANALYZE
  • 8. Strategic Transformations Unified View Real-time, predictive Insight-driven optimization TO Fragmented View Historical Reporting Results FROM
  • 9. Traditional Data Sources - Reporting
  • 10. New Data Sources - Predicting
  • 11. Big Data Analysis Characteristics • Integrate – Traditional and New data • Explore – More data, More sources • Discover – Plan, Visualize, Model, Act
  • 12. Big Data Analysis In Retail: The Problem Fashion retailer sees flat and declining sales No apparent differences by geography or standard demographics New marketing program didn’t help
  • 13. Step 1: New Segmentation • Analyze weblog files – Response rates – Frequency and duration of visits – Shopping cart activity – Devices used to access • Cross reference with demographics – Affinity program – Online profiles • New insight: younger, affluent women are not buying
  • 14. Step 2: Sentiment Analysis • Analyze all comments – Social media, forums • Cross reference with customer information – Affinity programs – Online activity – Sales records • New insight: new segment expresses “out of stock”
  • 15. Step 3: Inventory Analysis • Analyze promoted products – No stocking problems • Cross-reference with all shopper activities – Online shopping cart activity – Affinity program – Shopper location information – “Out of stock” comments • Key insight: matching accessories are out of stock
  • 16. Big Data Analysis In Retail: The Answer Young women with higher disposable income (and smart phones) did not buy a designer sweater when the matching sleeveless top was out of stock.
  • 17. Exadata Exalytics Oracle Big Data Platform ACQUIRE ORGANIZE DECIDEANALYZE Big Data Appliance
  • 18. Oracle Exadata Database Machine • Fastest Data Warehouse & OLTP • Best Cost/Performance Data Warehouse & OLTP • Optimized Hardware (per rack) • Processor: up to128 Intel Cores and 2 TB DRAM • Network: 880 Gb/Sec Throughput • Storage: 5 TB Flash and up to 336 TB Disk • Software Breakthroughs • Exadata Smart Storage Grid • Smart Flash Cache • Hybrid Columnar Compression • Parallel Scale-Out Database and Storage • Scales from ¼ Rack to 8 Full Racks Data Warehousing, Transaction Processing, Consolidation
  • 19. Oracle In-Database Analytics Platform XML Relational OLAP Spatial Data Layer RDF Media Parallel Processing Engine Oracle R Enterprise Oracle Data Mining Text and Search Spatial Analytics SQL Analytics Oracle MapReduce
  • 20. Oracle In-Database Analytics New: Oracle Advanced Analytics 2 miles Statistical Data Mining Text Graph Spatial Semantic
  • 21. Oracle Exalytics In-Memory Machine First engineered system for analytics Visual Analysis without limits Smarter analytic applications
  • 22. End-user Experience with Exalytics Speed of Thought Interactive Analysis Interactive Analysis Free Exploration Dense Visualizations Fully Mobile
  • 23. Over 80 Analytic Applications Run on Exalytics No application changes required Financials, HR Sales, marketing Planning, forecasting Many industries
  • 24. Analyzing Big Data • Comprehensive • Enterprise ready • Engineered to work together • Optimized for extreme analytics
  • 25. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1225 Oracle Exadata Oracle Exalytics Oracle Big Data Platform Stream Acquire Organize Discover & Analyze Oracle Big Data Appliance Oracle Big Data Connectors Optimized for Analytics & In-Memory Workloads “System of Record” Optimized for DW/OLTP Optimized for Hadoop, R, and NoSQL Processing Oracle Enterprise Performance Management Oracle Business Intelligence Applications Oracle Business Intelligence Tools Oracle Endeca Information Discovery Hadoop Open Source R Applications Oracle NoSQL Database Oracle Big Data Connectors Oracle Data Integrator Data Warehouse Oracle Advanced Analytics Oracle Database
  • 26. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1226 Use Case Introduction  Oracle MoviePlex is an on-line movie streaming company  Like many other on-line stores, they needed a cost effective approach to tackle their “big data” challenges  They recently implemented Oracle’s Big Data Platform to better manage their business, identify key opportunities and enhance customer satisfaction
  • 27. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1227 Common Big Data Challenge  Applications are generating massive volumes of unstructured data that describe user behavior and application performance  Today, most companies are unable to fully capitalize on this potentially valuable information due to cost and complexity  How do you capitalize on this raw data to gain better insights into your customers, enhance their user experience and ultimately improve profitability? {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7} {"custId":1067283,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:55","recommended":null,"activity":9} {"custId":1377537,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:58","recommended":null,"activity":9} {"custId":1347836,"movieId":null,"genreId":null,"time":"2012-07-01:00:02:03","recommended":null,"activity":8} {"custId":1137285,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:39","recommended":null,"activity":8} {"custId":1354924,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:51","recommended":null,"activity":9} {"custId":1036191,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:55","recommended":null,"activity":8} {"custId":1143971,"movieId":1017161,"genreId":44,"time":"2012-07-01:00:04:00","recommended":"Y","activity":7} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:04:03","recommended":"Y","activity":5} {"custId":1273464,"movieId":null,"genreId":null,"time":"2012-07-01:00:04:39","recommended":null,"activity":9} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4}
  • 28. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1228 Common Big Data Challenge  Applications are generating massive volumes of unstructured data that describe user behavior and application performance  Today, most companies are unable to fully capitalize on this potentially valuable information due to cost and complexity  How do you capitalize on this raw data to gain better insights into your customers, enhance their user experience and ultimately improve profitability? {"custId":1185972,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:07","recommended":null,"activity":8} {"custId":1354924,"movieId":1948,"genreId":9,"time":"2012-07-01:00:00:22","recommended":"N","activity":7} {"custId":1083711,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:26","recommended":null,"activity":9} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:32","recommended":"Y","activity":7} {"custId":1010220,"movieId":11547,"genreId":44,"time":"2012-07-01:00:00:42","recommended":"Y","activity":6} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:43","recommended":null,"activity":8} {"custId":1253676,"movieId":null,"genreId":null,"time":"2012-07-01:00:00:50","recommended":null,"activity":9} {"custId":1351777,"movieId":608,"genreId":6,"time":"2012-07-01:00:01:03","recommended":"N","activity":7} {"custId":1143971,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:07","recommended":null,"activity":9} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:01:18","recommended":"Y","activity":7} {"custId":1067283,"movieId":1124,"genreId":9,"time":"2012-07-01:00:01:26","recommended":"Y","activity":7} {"custId":1126174,"movieId":16309,"genreId":9,"time":"2012-07-01:00:01:35","recommended":"N","activity":7} {"custId":1234182,"movieId":11547,"genreId":44,"time":"2012-07-01:00:01:39","recommended":"Y","activity":7} {"custId":1067283,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:55","recommended":null,"activity":9} {"custId":1377537,"movieId":null,"genreId":null,"time":"2012-07-01:00:01:58","recommended":null,"activity":9} {"custId":1347836,"movieId":null,"genreId":null,"time":"2012-07-01:00:02:03","recommended":null,"activity":8} {"custId":1137285,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:39","recommended":null,"activity":8} {"custId":1354924,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:51","recommended":null,"activity":9} {"custId":1036191,"movieId":null,"genreId":null,"time":"2012-07-01:00:03:55","recommended":null,"activity":8} {"custId":1143971,"movieId":1017161,"genreId":44,"time":"2012-07- 01:00:04:00","recommended":"Y","activity":7} {"custId":1363545,"movieId":27205,"genreId":9,"time":"2012-07-01:00:04:03","recommended":"Y","activity":5} {"custId":1273464,"movieId":null,"genreId":null,"time":"2012-07-01:00:04:39","recommended":null,"activity":9} {"custId":1346299,"movieId":424,"genreId":1,"time":"2012-07-01:00:05:02","recommended":"Y","activity":4} How can you get answers to….
  • 29. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1229 Derive Value from Big Data  Make the right movie offers at the right time?  Better understand the viewing trends of various customer segments?  Optimize marketing spend by targeting customers with optimal promotional offers?  Minimize infrastructure spend by understanding bandwidth usage over time?  Prepare to answer questions that you haven’t thought of yet! How can you ….
  • 30. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1230 Oracle Exadata Oracle Big Data Appliance MoviePlex Architecture Application Log Log of all activity on site Capture activity nec. for MoviePlex site Streamed into HDFS using Flume Load Recommendations Customer Profile (e.g. recommended movies) Oracle NoSQL DB HDFS Map Reduce ORCH - CF Recs. Map Reduce Hive - Activities Map Reduce Pig - Sessionize Clustering/Market Basket Oracle Advanced Analytics Oracle Exalytics Endeca Information Discovery Oracle Business Intelligence EE “Mood” Recommendations Load Session & Activity Data Oracle Big Data Connectors
  • 31. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1231 Acknowledgements  Movie information courtesy of The Internet Movie Database (http://www.imdb.com). Used with permission.  Movie images provided by the TMDb API but is not endorsed or certified by TMDb  All customer information and session details are completely fictitious
  • 32. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1232 ANALYZE DECIDE ACQUIRE ORGANIZE DISCOVER VISUALIZE Oracle’s Big Data Platform STREAM OPERATIONALIZE
  • 33. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1233 Program Agenda  Oracle SQL Connector for HDFS – Brief Overview – Hands-on Exercises  Oracle Loader for Hadoop – Brief Overview – Hands-on Exercises  (Optional exercise): Use both connectors together
  • 34. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1234 Loading and Accessing Data from Hadoop SHUFFLE /SORT SHUFFLE /SORT MAP MAP MAP MAP SHUFFLE /SORT REDUCE REDUCE INPUT 2 INPUT 1 MAP MAP MAP MAP MAP REDUCE REDUCE MAP MAP MAP MAP MAP REDUCE REDUCE REDUCE Oracle SQL Connector for HDFS Oracle Loader for Hadoop Oracle Database LOG FILES REDUCE
  • 35. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1235 Hadoop Oracle Database Oracle’s Big Data Platform Oracle Big Data Connectors
  • 36. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1236 Oracle’s Big Data Platform ACQUIRE ORGANIZE ANALYZE Oracle Big Data Connectors Hadoop Big Data Connectors work with • Oracle’s engineered systems, and • Other hardware SHUFFLE /SORT SHUFFLE /SORT REDUCE REDUCE REDUCE MAP MAP MAP MAP MAP MAP REDUCE REDUCE
  • 37. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1237 Oracle SQL Connector for HDFS
  • 38. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1238 Oracle SQL access to Hive tables and HDFS files Automated generation of external table to access the data Query data in-place or load Access or load data in parallel Oracle SQL Connector for HDFS High Performance Access and Load from Hadoop with Oracle SQL External TableODCH ODCH OSCH SQL Query Hadoop Oracle Database
  • 39. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1239 Part 1a: Reading Hive Tables with Oracle SQL Connector for HDFS  cd /home/oracle/movie/moviework/osch – This directory contains the scripts genloc_moviefact_hive.sh, moviefact_hive.xml  Execute the script  sh genloc_moviefact_hive.sh – (the password is: welcome1)
  • 40. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1240 Part 1a  The script sh genloc_moviefact_hive.sh hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf /home/oracle/movie/moviework/osch/moviefact_hive.xml -createTable
  • 41. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1241 Part 1  Examine the Hadoop configuration properties – more moviefact_hive.xml moviefact_hive.xml
  • 42. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1242 Part 1a  Query the table – sqlplus moviework/oracle SQL> select count(*) from movie_fact_ext_tab_hive; SQL> select custid from movie_fact_ext_tab_hive where rownum < 10; SQL> select custid, title from movie_fact_ext_tab_hive p, movie q where p.movieid = q.movieid and rownum < 10;
  • 43. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1243 Installing Oracle SQL Connector for HDFS External Table Hadoop Cluster Oracle Database System OSCH Hadoop Client Hive Client OSCH
  • 44. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1244 Part 1b: Reading Text Files on HDFS with Oracle SQL Connector for HDFS  cd /home/oracle/movie/moviework/osch – This directory contains the scripts genloc_moviefact_text.sh, moviefact_text.xml  Execute the script  sh genloc_moviefact_text.sh – (the password is: welcome1)
  • 45. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1245 Part 1b  The script genloc_moviefact_text.sh hadoop jar $OSCH_HOME/jlib/orahdfs.jar oracle.hadoop.exttab.ExternalTable -conf /home/oracle/movie/moviework/osch/moviefact_text.xml -createTable
  • 46. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1246 Part 1b  Examine the Hadoop configuration properties – more moviefact_file.xml moviefact_text.xml
  • 47. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1247 Performance Comparison 0 1 2 3 4 5 6 Fuse-DFS Oracle Direct Connector for HDFS Load rate (TB/hour) Load speed comparison CPU usage comparison Fuse DFS 0 20 40 60 80 100 120 140 160 180 Fuse-DFS Oracle Direct Connector for HDFS CPUsecondsusedperGB CPU Usage
  • 48. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1248 Oracle Loader for Hadoop
  • 49. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1249 Oracle Loader for Hadoop SHUFFLE /SORT SHUFFLE /SORT REDUCE REDUCE REDUCE MAP MAP MAP MAP MAP MAP REDUCE REDUCE ORACLE LOADER FOR HADOOP High Performance Loader Convert data into Oracle ready data types on Hadoop Offload data pre-processing from the database server to Hadoop Load pre-processed data online or offline Automatically handle input data skew Works with a range of input formats Connect to the database from reducer nodes, load into database partitions in parallel Partition, sort, and convert into Oracle data types on Hadoop
  • 50. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1250 Part 2: Oracle Loader for Hadoop  Examine the data files on HDFS – hadoop fs -ls /user/oracle/moviedemo/session
  • 51. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1251 Part 2: Oracle Loader for Hadoop  cd /home/oracle/movie/moviework/olh – This directory contains all the necessary scripts moviesession.sql, moviesession.xml, loaderMap_moviesess ion.xml, runolh_session.sh
  • 52. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1252 Part 2  Create the table data will be loaded into – sqlplus moviedemo/welcome1 SQL> @moviesession.sql
  • 53. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1253 Part 2  Submit the Oracle Loader for Hadoop MapReduce job – sh runolh_session.sh hadoop jar ${OLH_HOME}/jlib/oraloader.jar oracle.hadoop.loader.OraLoader -conf home/oracle/movie/moviework/olh/moviesession.xml
  • 54. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1254 Part 2  Examine the Hadoop configuration properties – more moviesession.xml moviesession.xml
  • 55. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1255 Part 2  Examine the loaderMap file – more loaderMap_moviesession.xml loaderMap_moviesession.xml
  • 56. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1256 Installing Oracle Loader for Hadoop Target Table Hadoop Cluster Oracle Database System Hive ClientOLH
  • 57. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1257 Performance Comparison Load speed comparison CPU usage comparison Third party products 0 0.5 1 1.5 2 2.5 Comparable third party product Oracle Loader for Hadoop Load rate (TB/hour) 0 100 200 300 400 500 600 700 Comparable third party product Oracle Loader for Hadoop CPUsecondsusedperGB CPU Usage
  • 58. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1258 Versions  Certified Versions – Oracle Database 11.2.0.2 and higher – Hadoop distributions  CDH3, CDH4 (versions of Cloudera’s Distribution including Apache Hadoop)  Apache Hadoop 1.0.x, 1.1.1  Should work with Hadoop distros based on certified Apache Hadoop versions
  • 59. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1259 Oracle Loader for Hadoop and Oracle SQL Connector for HDFS SHUFFLE /SORT SHUFFLE /SORT REDUCE REDUCE REDUCE MAP MAP MAP MAP MAP MAP REDUCE REDUCE ORACLE LOADER FOR HADOOP External TableODCH ODCH OSCH SQL Query HDFS Client Oracle Database ORACLE SQL CONNECTOR FOR HDFSOffline load: Data pre- processed and written as Oracle Data Pump format in HDFS. Oracle Data Pump files in HDFS queried (and loaded if necessary) with Oracle SQL Connector of HDFS.
  • 60. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1260 Thank You! http://goo.gl/XkwxwM
  • 61. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1261 <Insert Picture Here> Twitter http://twitter.com/raul_goycoolea Raul Goycoolea Seoane Keep in Touch Facebook http://www.facebook.com/raul.goycoolea Linkedin http://www.linkedin.com/in/raulgoy Blog http://blogs.oracle.com/raulgoy/ Raul Goycoolea S. Multiprocessor Programming 6116 February 2012
  • 62. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1262
  • 63. Copyright © 2012, Oracle and/or its affiliates. All rights reserved. Insert Information Protection Policy Classification from Slide 1263