Hadoop, you may want to either access that data from Oracle Database by issuing SQL against HDFS files or by moving the data into Oracle tables.Lets start with the latter -- moving the data into Oracle tables. Oracle Loader for Hadoop (or OLH) is a high performance loader for fast movement of data from any Hadoop cluster into Oracle Database tables. Like all other parts the Big Data Connectors, it is available on any Hadoop cluster based on Apache Hadoop in addition to the Big Data Appliance.If you want to take the results and perform additional analysis using advanced BI and data warehousing technologies or incorporate in other applications, OLH is both fast and reduces the processing load on the Database server. It runs as a map reduce job and uses the Hadoop server’s processing resources to sample, sort and pre-partition the data based on the target database metadata. It can automatically take input in delimited text files (CSV) or Hive tables or you can write your own input format. OLH can either directly load the results into the database using the parallel direct path load interface or JDBC, create Oracle formatted Datapump files. OLH has built into load balancing across the reducer nodes that prevents performance from degrading due to unbalanced loads.
Oracle Direct Connector for HDFS makes it possible to access to data on the Hadoop cluster in HDFS from Oracle using SQL. It provides a virtual table view of the HDFS files and the allows for parallel query access to data using the standard Oracle database external table mechanism. If you are using BDA and Exadata, the connectivity occurs using infiniband network fabric so the database access to HDFS, in the very scientific words of the development manager, “flies”. If you need to import the data in HDFS into Oracle, the Direct Connector does not require a file copy and without using Linux Fuse. Instead it uses the native Oracle Loader interface.
If you already use Oracle Data Integrator (or are familiar with this kind of tool and want to use ODI), then it can simplify the MapReduce process.As long as you can describe the transformation that you need to perform on the data, ODI can generate the MapReduce code for you and run that process. It can even invoke Oracle Loader for Hadoop at the end of the cycle.So if you are not an expert in Java, parallel algorithms and the Hadoop framework, there is still a way to use it all to organize your code.Note:ODI generates SQL code which is then passed into Hive (a component of many Hadoop distributions) which generates the actual Java MapReduce codeYou need Big Data Connectors, specifically the ODI Application Adaptor for Hadoop, to make all this work
Our view of the BI landscape is that there are fundamentally two dominant types of problems.On one hand there are questions where we can define up-front both the process and the data required to answer them. What are sales forecasts by region? What is my performance relative to expectation?On the other hand are questions where either the process or the data cannot be defined ahead of time; these questions are open ended by nature. What customers should I target? Why are my sales going down? It's also interesting to point out that these questions are far more transient than the other type, and this follows from their open ended nature. Each question leads to new questions. The interaction model for the former is more like “looking it up”; it’s a report or dashboard. On the other hand, when you don't know exactly what you need or how to ask for ii, the necessary interaction model is exploration and discovery. A dialog with the data.<transition>It also follows that, as a matter of practice, some data is modeled and other data is not. We take modeled to mean that there is a single, overarching semantic model. Of course, modeling costs time and money and so we generally only make the investment in cases where the expected return on that investment is large enough to justify the effort.The cost of storing un-modeled data has continued to drop but importantly, with the popularization of Hadoop, the promise of deriving value from un-modeled data is rising rapidly. The result is an explosion in the capture of un-modeled data.Through this view of the BI landscape we can see how Traditional Business Intelligence and Data Discovery fit in.<transition>Traditional Business Intelligence is purpose built and very strong for known questions and modeled data. Friction arises when organizations attempt to use these products for new and unpredictable questions, which require similarly new and unpredictable data models to meet the need.<transition>In the other space is the emerging market category of data discovery, where the goal is to provide everyday business users with fast answers to new questions to make better, more informed business decisions. Data discovery tools follow several key market trends:First, the growth in data volume, diversity, and complexity. Not much to say here that hasn't already been said. Organizations today are beginning to understand the value inherent in this information and are looking for tools that can unlock that value to give them competitive advantage. And more and more users need to access and understand this information.Second, the consumerization of business software. When IT is unable to deliver, business users are increasingly willing to go outside of IT in order to meet their own needs. Empowered with their choice of tools, and with expectations formed in the consumer world, expectations for amazing user experiences have never been higher.
How do we do it. Endeca Information Discovery provides a full featured platform for creating discovery applications that provide access to all kinds of informationDrilling into the architecture, we accomplish this with three tiers
Notes:This slide is a logical representation of the scope of a Big Data solution. It provides the basis for describing data flows in each stage of the Big Data process in the following slides.The scope of a Big Data solution includes taking actions and decisions on the results of analysis, hence integration with Applications.Real-time event detection can be part of a Big Data solution. This is an important point to draw out because IBM claims it’s Steams capability is a USP, see the book Understanding Big Data, Analytics for Enterprise Class Hadoop and data Streaming.
Transcript of "2012 10 bigdata_overview"
Big DataJean-Pierre Dijcks
Agenda• Big Data• Strategy• Technology• Use Cases
Big Data React to an Event Pro-Actively Change Outcomes “Technology presents the opportunity to transform business“* Mark Hurd, President, Oracle* Oracle Profit Magazine, Volume 17, Number 1
Big Data’s Key Ingredient “ Improvement merely lets you Big Data transforms hit the numbers. Creativity is our business 5% what transforms.“* Ron Johnson, CEO, JCPenney Big Data improves our business 20% What is Big Data? 75%* Fortune Magazine VOL. 165, NO. 4
Big Data Extends the Breadth and Speed of Data Video and ImagesBig Data:Decisions based Documentson all your data Social Data Machine-Generated Data Information Architectures Today: Transactions Decisions based on database data
Big Data Extends the Depth of Analytics Graph Analytics StatisticsQuery and Reporting Data Mining 2 miles Spatial Analytics Text Analytics
Big Data Defined Big Data: Techniques and Technologies that Enable Enterprises to Effectively and Economically Analyze All of their Data
Big Data ApplianceHardware: • 288 CPU cores with 1152 GB RAM • 648 TB of raw disk storage • 40 Gb/s InfiniBandIntegrated Software: • Oracle Linux • Oracle Java VM • Cloudera Distribution of Apache Hadoop (CDH) • Cloudera Manager • Open-source distribution of R • NoSQL Database Community EditionAll integrated software (except NoSQL DB CE) is supported as part of Premier Support for Systems and Premier Support forOperating Systems
Oracle Big Data Appliance File System Mount UI Framework SDK FUSE-DFS HUE HUE SDK Workflow Scheduling Metadata APACHE OOZIE APACHE OOZIE APACHE HIVE Languages / Compilers APACHE PIG, APACHE HIVE, APACHE MAHOUT Fast Data Read/Write Integration Access APACHE FLUME, APACHE APACHE HBASE SQOOP HDFS, MAPREDUCE Coordination APACHE ZOOKEEPER
Why Cloudera?• Includes Open Source Apache Hadoop – Fast evolution in critical features – Proven at very large scale• Managed Distribution – Components certified to work together in regular updates – Cloudera Manager provides Management GUI• Most popular distribution in the market
Oracle and Cloudera• All Cloudera software pre-installed and pre-configured on BDA – Engineered with Cloudera• All Cloudera assets included – Single Oracle Product SKU for HW & SW – Single Oracle Support SKU for HW & SW (life of the machine)• Oracle is the single point of contact for the solution
Price comparisonOracle Big Data Appliance “Build-Your-Own” – HP hardware and Cloudera Year 1 Year 2 Year 3 Total Year 1 Year 2 Year 3 Total Servers and BDA Cost $450,000 $428,220 switches Support $54,000 $54,000 $54,000 Support Cost $136,233 $72,000 $72,000 Cost On-site Installation & Installation $14,150 configuration not included Total $518,150 $54,000 $54,000 $626,150 Total $564,453 $72,000 $72,000 $708,453Full details at https://blogs.oracle.com/datawarehousing/entry/price_comparison_for_big_data
Oracle NoSQL DatabaseA distributed, scalable key-value database• Simple Data Model • Key-value pair with major+sub-key paradigm Application Application • Read/insert/update/delete operations NoSQLDB Driver NoSQLDB Driver• Scalability • Dynamic data partitioning and distribution • Optimized data access via intelligent driver• High availability • One or more replicas • Disaster recovery through location of replicas • Resilient to partition master failures • No single point of failure• Transparent load balancing Storage Nodes Storage Nodes • Reads from master or replicas Data Center A Data Center B • Driver is network topology & latency aware
Big Data ConnectorsOptimized integration of Hadoop with Oracle Databaseand Oracle Exadata• Oracle Loader for Hadoop• Oracle Direct Connector for Hadoop Distributed File System (HDFS)• Oracle Data Integrator Application Adapter for Hadoop• Oracle R Connector for Hadoop• Does not require Big Data Appliance – can be licensed for Hadoop running on non-Oracle hardware
Oracle Loader for HadoopUse The Cluster ORACLE LOADER FOR HADOOP MAP REDUCE MAP Last stage in MapReduce MAP SHUFFLE /SORT REDUCE workflow Partitioned and non- MAP REDUCE partitioned tables MAP REDUCE SHUFFLE MAP /SORT REDUCE Online and offline loads
Oracle Direct Connector for HDFSDirect Access from Oracle Database HDFS Oracle Database SQL Query SQL access to HDFS External Table External table view Data query or import DCH DCH HDFS Infini Band DCH Client
Oracle Data IntegratorSimplifying MapReduce Oracle Data Integrator Automatically generates MapReduce code Oracle Loader for Manages the process Hadoop Loads into Data Warehouse
What is Data Discovery? Simplified Quickly explore all relevant data Relationships Advanced search Structured undefined or unknown Faceted navigation Semi-structured No pre-defined model Analytics Unstructured required Messy data Rapid, iterative change Beyond the data warehouse
Business Intelligence and Data Discovery Complementary Solutions, Integrated Business Processes Known & Clearly Uncertain or Defined Questions Open-Ended Questions Who, What, When? Why, How, What Else? Un-modeled Data Insights yield Data Discovery mature modelsDiverse and Changing Models and KPIs Fast Answers to New Questions New questions Modeled Data Business Intelligence require new Proven Answers to Known Conforms to a Single Model Questions data, explorati on
Oracle Endeca Information DiscoveryA platform for data discovery applications across the enterprise Endeca Information Discovery (EID) helps organizations quickly explore all relevant data • Combine structured & unstructured data from disparate systems • Rapidly assemble easy to use analysis applications • Automatically organize information for search, discovery & analysis