Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse


Published on

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
    Twitter: @DanGraham_
  • Key Message: A good business model is enormously bigger than a star schema or snowflake schema. Its 1000s of tables.
    A logical data model is concerned with providing a representation of data throughout the entire corporation. The LDM is focused on entities (named things), attributes (details about the named thing) and relationships (mutual keys and table layouts.)
    Each subject area in the LDM contains 20-200 tables once it is translated into a physical data model. Each subject area provides the plan for integrating numerous data sources into a consistent representation of the business itself. Since subject areas are so grand in scope, all the tables may not be implemented and populated initially. Instead, the Professional Services experts focus on loading data that will produce results quickly and help solve customer pains.
    A Logical Data Model
    Helps establish a common understanding of business data elements and requirements
    Provides foundation for designing a database
    Facilitates data re-use and sharing
    Logical Models are about relationships
    Independent of function
    Independent of technical limits
    Physical models are about technology functions
    Data Management
  • Late binding’s advantage is that differing late bindings can be used as the discovery process uncovers the best way to analyze the data. Say a street address is in a clob. You can look at the data from a zip code perspective and then a city/street perspective.
    The short answer is early binding means mapping record layouts at compile time and late binding maps record layouts at runtime.  Late binding is anytime the program must locate the data within a record at runtime as opposed to knowing the record layout when the program is compiled and the data is first stored in a repository.
    Late binding is the process of figuring out the data format and physical location in memory at runtime.  That is, the program does not know how data is stored on disk until the data is read into memory.  At that point in time, the program uses a function (a subroutine) to extract the data from the record and/or file.  This is a common technique in Java programming called a SerDes or a method (two different functions).
    Teradata and Hadoop use similar approaches to solving the late binding task.  If a block of data arrives in memory without a known structure, both application environments call a subroutine to parse out the relevant fields in the block.  Hadoop uses SerDes or Java “methods”.  Teradata uses JSONPath operators.  They all do the same thing.
    One difference between Teradata and Hadoop is that Hadoop only really does late binding on flat files.  When Hive is used, the data is “data typed” meaning the data is expected to be stored in rows and columns, just like a relational database. 
    JSON and XML tools extract the data at runtime, parsing out the meaningful data while the query is running, placing the result in virtual columns created only for the duration of the query.  This is in contrast to parsing out data and placing it columns during the normal ETL processing cycle.  It is late binding in the sense that data is inside an object (CLOB or Varchar) and is parsed out at runtime into a format usable by the query.  There is no schema for the Blob or Varchar, no fixed format that can be relied on without the parsing.
    Table operators have the ability to define an input schema and an output schema at query runtime.  Generally the input data is existing relational tables but can include foreign data structures such as Oracle tables or flat files.  When foreign tables or files are inputs, a dynamic internal schema is generated for the purpose of reading data.  This is a form of late binding.  The output table format is always dynamically defined, and is a form of late binding since it does not involve a predefined schema.  These are not what most people consider late binding but technically it is identical. 

  • What items do we need to recall based on the quality issue on 6/16 with product #96?

    CAST looks at the JSON data type and formats it as a timestamp.
  • Schema-on-read is a Java and C++ term that means the data is not organized and validated at the time it is first written to disk. With schema-on-read, the data is read from disk and at runtime a schema layout is applied to most of the data while some data must be located via parsing. This is poular in the process oriented languages because it allows for flexibility in interpreting data, especially data that has not been explored before.
    Schema-on-read allows us to
    Querying data without having to fully understand it before using it. This is discovery and exploration.
    Schema on read makes for a very fast initial load, since the data does not have to be transformed or validated.
    It is more flexible, too: consider having two schemas for the same underlying data, depending on the analysis being performed.
    Some data cannot be defined in a schema since the name-value pairs can be in different locations in the JSON object.

    Why not just materialize all possible JSON columns immediately? There are a few reasons:
    In some cases, the result would be unwieldy and sparse. You could end up with 50M nulls or worse.
    You don’t always know weeks and months in advance of new data arrivals.
    We want to preserve the agile nature of JSON with new data flowing into the system without extensive modeling and governance
  • The UDA architecture allows us to identify major subsystems and in this case actual hardware platforms performing the processing.
  • Adding MongoDB to the QueryGrid is the vision we are working towards.
    The unique characteristics of each specialized engine are brought to bear on the IDW work.
  • MongoDB builds its scale-out architecture using Shards. These are similar to the concept of AMPs in Teradata or Vworkers in Aster. Data is hashed across the MongoDB cluster and stored in a primary shard. It is also replicated to a secondary shard on another node to enable recovery should the primary shard be unavailable.
    Connectivity to shards is actually done through the query routers which send requests to the correct cluster node based on hashed keys. Its drawn this way for simplicity.
  • Note: click for animations
    A table operator request is submitted to PE
    PE launches contract function via the EAH
    EAH opens JDBC to Query Router
  • Note: click for animations
    EAH requests table metadata for specified table
    Metadata also includes ??? information
    PE & dispatcher distribute the output row format to all AMPs

  • Note: click for animations
    Each AMP is mapped to a series of Shards
    AMP connects to its corresponding Shard via the EAH

  • Note: click for animations
    Each AMP reads rows of data from a shard and spools the reformatted row into Teradata spool
  • This is an existing Teradata customer who has evolved into using MongoDB for their eCommerce website. Formerly a mail order company, they have become a full eTailer. On a nightly basis, they extract data from MongoDB and load it into the data warehouse. They use deep dive predictive analytics, buyer preferences, promotional objectives, and other data to provide context and next-best-offers to the MongoDB application. Once calculated, the new information is exported to files and loaded into the MongoDB shards to make the website visitor experience more relevant and hopefully more sales come with it.
  • THE major source of rich customer information is in the data warehouse. For years, DWs have collected customer purchases, payment history, buyer preferences, claims, plus next best offers and upsell opportunities. A lot of this data is historical going back 3-5 years. And some of it is the result of predictive analytics coupled with campaign management tools
    Real time tactical access to the data warehouse is the same as accessing any relational database. We call this Active Data Warehousing. 100s of Teradata customers are accessing data in near real time with their Active Data Warehouse.
    Combining these rich subject areas with MongoDB JSON data helps provide a faster time to resolution, next best offers, and the correct customer treatments based on their status with the corporation.
  • One of the key IoT concepts is the development of intelligent, connected “edge” devices. One example for such an IoT device is the Bosch Rexroth Nexo,
    An industrial nut runner wrench which is equipped with an on-board computer and wireless connectivity. The on-board computer supports many aspects of the
    tightening process, from configuration (e.g. which torque to use) to creating a protocol of the work completed (e.g. which torque was actually measured).
    In addition, the Nexo features a laser scanner for component identification. By integrating such an intelligent edge device into the IoT, very powerful services can be developed that can help with supply chain optimization and modularizing the production line. For example, these intelligent tightening tools can now be managed by a central asset management application, which provides different services:
    • Basic services could include features like helping to actually locate the equipment in a large production facility
    • Geo-fencing concepts can be applied to help ensure only an approved tool with the right specification and configuration can be used on a specific product in a production cell.
    MongoDB collects sensor data like this and makes it available to Java applications for tracking. Passing the data to Teradata allows for deep trend analysis, maintenance planning, and other IoT analytics.
  • Top 5 Things to Know About Integrating MongoDB into Your Data Warehouse

    2. 2. 2 Copyright Teradata Scale-out NoSQL + Scale-out DW Data Warehouse = context JSON in the Data Warehouse Integration: Data Sharing Use Cases
    3. 3. 3 Copyright Teradata • Analytic database > In-memory, in-database • Scale-out MPP > 30+ petabyte sites > 35PB, 4096 cores • Self service BI > Dashboards, reports, OLAP > Predictive analytics • Complex SQL > 20-50 way joins > 350 pages of SQL • Real time access/load • Mixed workloads What is a Teradata Data Warehouse? Data scientists Power users Sales, partners 1024 nodes Intel CPUs 512GB Intel CPUs 512GB Intel CPUs 512GB Intel CPUs 512GB
    4. 4. 4 Copyright Teradata What is a Data Warehouse? Context Price history Inventory Supplier Contracts Product/Services Channels E-Commerce Labor Associate Customer Sales transactions Point of Sale ShipmentCarrier Campaigns Promotion Warehouse
    5. 5. 5 Copyright Teradata A Day at the Ticket Agency • 185 applications > Travel agents & corporate travel managers > Mobile: airline executives > Corporate travel managers and travel agents > Hoteliers • Teradata 5650 V13.10 > 25TB of data > 1000+ users • Mini-batch every 15 min • GoldenGate replication • Tactical queries 0.2 seconds • 14M queries/day 99.7 99.78 99.98 99.94 99.4 99.6 99.8 100 2008 2009 2010 2011 Availability
    6. 6. 6 Copyright Teradata Teradata in the Data Warehouse Market
    7. 7. 7 Copyright Teradata Forrester Data Warehouse Wave December 2013
    9. 9. 9 Copyright Teradata Late Binding in SQL Early binding Late binding RuntimeLoad time Data Warehouse Source data Schema ETL table SQL + JSONPath BI tools JSON
    10. 10. 10 Copyright Teradata JSONPath inside SQL Color Size Prod_ID Create_Time ----- ----- ------- ------------------- Blue Small 96 2013-06-17 20:07:27 SELECT box.MFG_Line.Product.Color AS "Color", box.MFG_Line.Product.Size AS "Size", box.MFG_Line.Product.Prod_ID AS "Prod_ID", box.MFG_Line.Product.Create_Time AS "Create_Time" FROM mfgTable WHERE CAST(box.MFG_Line.Product.Create_Time AS TIMESTAMP) >= TIMESTAMP'2013-06-16 00:00:00' AND box.MFG_Line.Product.Prod_ID = 96;
    11. 11. 11 Copyright Teradata • JSON object  schema column > Treated like any column > Use any BI tool • Apply “schema” at runtime • Why not shred JSON into columns? > Urgency, agility > Bypass extensive change controls > Complex data – Bill of materials, etc. Flexible: Schema-on-Read
    13. 13. Math and Stats Data Mining Business Intelligence Applications Languages Marketing ANALYTIC TOOLS & APPS USERS INTEGRATED DISCOVERY PLATFORM INTEGRATED DATA WAREHOUSE ERP SCM CRM Images Audio and Video Machine Logs Text Web and Social SOURCES DATA PLATFORM ACCESSMANAGEMOVE TERADATA UNIFIED DATA ARCHITECTURE System Conceptual View Marketing Executives Operational Systems Frontline Workers Customers Partners Engineers Data Scientists Business Analysts TERADATA DATABASE HORTONWORKS TERADATA DATABASE TERADATA ASTER DATABASE
    14. 14. 14 Copyright Teradata TERADATA ASTER DATABASE SQL, SQL-MR, SQL-GR OTHER DATABASES Remote Data Teradata and MongoDB: Next Steps Teradata Systems TERADATA DATABASE HADOOP Push-down to Hadoop IDW TERADATA DATABASE Discovery TERADATA ASTER DATABASE Business users Data Scientists MONGODB NoSQL Database
    15. 15. Export / Import Direct Connect INTEGRATION
    16. 16. 16 Copyright Teradata • Operational + Analytical > Rich MongoDB applications > Rich Teradata analytics > Complementary • Teradata pulls directly from MongoDB sharded clusters • Teradata pushes back to MongoDB deployments Teradata and MongoDB MongoDB Teradata Application Data Analytics
    17. 17. 17 Copyright Teradata Scale-out NoSQL + Scale-out DW SQL Application Primary Shard 1 Primary Shard 2 Primary Shard N Primary Shard 3 Query router Query router Query router NoSQL SQL AMPAMP PE AMPAMP PE AMPAMP PE AMPAMP PE
    18. 18. 18 Copyright Teradata Query Router Shard 1 Shard 2 Shard 3 Shard 4 Contract Phase Teradata node AMP AMP AMP AMP PE SQL E A H
    19. 19. 19 Copyright Teradata Contract Phase Teradata node AMP AMP AMP AMP PE E A H Query Router Shard 1 Shard 2 Shard 3 Shard 4
    20. 20. 20 Copyright Teradata Data Export to Shards Teradata node AMP AMP AMP AMP PE E A H Query Router Shard 1 Shard 2 Shard 3 Shard 4
    21. 21. 21 Copyright Teradata Import Data from Shards Teradata node AMP AMP AMP AMP PE E A H Query Router Shard 1 Shard 2 Shard 3 Shard 4
    23. 23. 23 Copyright Teradata eCommerce in Action: A Virtuous Circle Buyer preferences Sales catalog Campaigns Recent purchases Profitability Data Warehouse Shard Shard Shard Shard Shard Shard Shard Shard
    24. 24. 24 Copyright Teradata Shard Shard Shard Shard Shard Shard Shard Shard Call Center Efficiency: A Virtuous Circle Trouble tickets Customer profiles Payment history Claims Next best offer Data Warehouse web logs
    25. 25. 25 Copyright Teradata Internet of Things: Making Sense of Sensors Condition-based maintenance R&D testing Yield management Warranty mgmt. Data Warehouse Shard Shard Shard Shard Shard Shard Shard Shard
    26. 26. 26 Copyright Teradata Conclusions • Two scale out architectures > OLTP scale-out > Analytics scale-out • JSON in the data warehouse • Context from the DW > Enriching MongoDB applications • Integration > Import/export > Teradata QueryGrid
    27. 27. 27 Copyright Teradata