Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hadoop is not an Island in the Enterprise


Published on

Published in: Technology, Business
  • Be the first to comment

Hadoop is not an Island in the Enterprise

  1. 1. Understanding Deployment Practices that Merge the Strengths of Hadoop and the Data Warehouse Joe Rao PS Consultant, Teradata Corporation HADOOP IS NOT AN ISLAND IN THE ENTERPRISE:
  2. 2. 2 6/17/2014 Teradata Confidential This presentation covers • A comparison of the strengths of Hadoop and a Data Warehouse • Architectures that involve Hadoop and the data warehouse working together AGENDA
  3. 3. 3 6/17/2014 Teradata Confidential • Our two platforms: > The Data Platform – Hadoop > The Enterprise Data Warehouse – Teradata • Both platforms could handle everything by themselves if we really wanted them to • Biased organizations will favor one over the other, and argue that everything can be done in one place • And they're both right FRAMING THE DISCUSSION
  4. 4. 4 6/17/2014 Teradata Confidential •Let's consider a software startup or company that has no IT department yet •They need to: > Acquire their technology from scratch > Build business logic from scratch > Staff their new department from scratch •With no existing technology, how should they structure their data center? FRAMING THE DISCUSSION
  5. 5. 5 6/17/2014 Teradata Confidential • Traditional data warehouses (like the Teradata database) have been used as the central repository of business data for years. • Data warehouses are great with: > Thousands of concurrent users and queries > Full ANSI SQL interfaces > Very complex SQL query logic > Advanced workload management > Transactional capabilities > Secure access DATA WAREHOUSE STRENGTHS
  6. 6. 6 6/17/2014 Teradata Confidential • Many companies that have been doing things the old way with a data warehouse don't think they need to change anything • What they've been doing has worked for years. Hadoop is young and immature they say. Why change? • These companies are change resistant. They are missing out on the advancements in big data and can fall behind their competition. DATA WAREHOUSE ONLY? I’m lonely
  7. 7. 7 6/17/2014 Teradata Confidential • Hadoop is changing the game in the enterprise data landscape. It's major strengths include: > Economical > Able to process extremely large data sets > Extremely flexible storage and processing > Open, free, active community development HADOOP STRENGTHS
  8. 8. 8 6/17/2014 Teradata Confidential • Appliance Solution > Purpose-built integrated hardware/software solution > Optimized hardware for Hadoop, software, storage, and networking in a single rack > Delivered ready to run at a competitive price point • Enterprise Ready > 100% open-source Hadoop via Hortonworks HDP > Integrated with Teradata Unified Data Architecture on 40GB/s InfiniBand BYNET V5 for performance and reliability > Support for major ETL tools, enhanced security, and metadata management > Management tools for monitoring system health • Benefits > Lowest TCO and fastest time to value > Fully engineered and supported by Teradata TERADATA APPLIANCE FOR HADOOP
  9. 9. 9 6/17/2014 Teradata Confidential • Many companies are so eager to jump onto the Hadoop wave that they think they can run their entire datacenter on Hadoop. • It's free, it has lots of development effort put into it, it's flexible. Why go the “old way” with an EDW? • These companies are using Hadoop beyond its design and maturity level, and may run into technical problems meeting requirements. HADOOP ONLY? I’m lonely
  10. 10. 10 6/17/2014 Teradata Confidential CONCLUSIONS — TWO TCOD EXAMPLES 1. TCOD is NOT platform cost – it is total project cost 2. Each technology has large advantages in its sweet spot(s) 3. Neither platform is cost effective in the other’s sweet spot 4. Biggest differences for the data warehouse are the development of:  Complex queries  Analytics Source: WinterCorp - Full report at Data Refining: Hadoop wins Also: Landing Zone, Archive EDW: Data W/H Platform Wins $0 $5 $10 $15 $20 $25 $30 $35 On Hadoop On Data Warehouse Millions $0 $100 $200 $300 $400 $500 $600 $700 $800 On Hadoop On Data Warehouse Millions Total System Cost System and Data Admin Application Development ETL Complex Queries Analysis
  11. 11. 11 6/17/2014 Teradata Confidential • These two platforms are complementary! • Successful enterprise datacenters merge the strengths of both platforms. EDW VS. HADOOP
  12. 12. 12 6/17/2014 Teradata Confidential • Split Workload Architecture • ETL System Architecture • Secure Access Architecture • Active Archive Architecture COMBINED ARCHITECTURES
  13. 13. 13 6/17/2014 Teradata Confidential Insurance Use Case Impact • Quickly analyze data for informed decisions and ad hoc reporting • Streamlined process to calculate vehicle and fleet scores • Cost effectively quantify, adjust and manage risk premiums Situation A large diversified customers needed to accurately calculate scores and adjust risk premiums for its enterprise fleets based on vehicle data, driver behavior, GPS data, weather data, traffic and DW data. Current custom developed applications limits the effectiveness of these scores. Problem Lacks infrastructure and system to handle the huge volumes of real time data. No ad-hoc reporting systems to combine, enrich and analyze the data. Limited storage capacity limits the amount of data that can be captured, refined and stored. Solution Used Teradata Big Analytics Appliance to design a platform to streamline the ingestion process for telematics data from multiple sources, data types, structure, and frequency and combine with other data sources to perform meaningful analytics.
  14. 14. 14 6/17/2014 Teradata Confidential HADOOP TeradataINTEGRATED DATA WAREHOUSE • The Data Warehouse and Hadoop run different workloads on different data sets. SPLIT WORKLOADS Big Data Operational Data
  15. 15. 15 6/17/2014 Teradata Confidential • It is not economical to put gigantic, “value sparse” data sets on an enterprise data warehouse. • Hadoop was not built to be an accessible, highly concurrent transactional database. • The easiest natural architecture is to split up the two platforms based on the data set and workload. > Teradata handles the operational business data and queries > Hadoop handles the cost prohibitive “big data” sets, such as web, machine, social data SPLIT WORKLOADS
  16. 16. 16 6/17/2014 Teradata Confidential • Both systems operate favorably on cost and performance with respect to their given workloads. • The business can analyze new data and gain new insights that their existing platform couldn't handle before. SPLIT WORKLOADS — BUSINESS VALUE
  17. 17. 17 6/17/2014 Teradata Confidential LARGE COMPUTER MANUFACTURER Analysis of Customer Web Interactions Capture, Refine, Store ClickStream Data Impact • Reduced data inconsistencies and improved performance • Capture and curate ALL the data and prepare for analysis • Perform ad hoc analytics on multi-level interactions • Improves the marketing campaigns and the customer support process Situation Customers interact interact with public websites of large PC vendor for various purposes — resulting in huge volumes of raw omniture data. Because of its nature, the data structure and format is not always consistent and because of the volumes, processing the amount of data is difficult. Problem Inconsistencies like file errors, corrupted file compressions in the raw omniture data makes the capturing and analysis process error prone. The volume, velocity (70files/hr, 1M files) adds to the complexity. Solution Teradata Big Analytics solution to provide a landing and staging area for in-coming data at high velocity. Hadoop nodes to curate the data, check for data consistency, and prepare the data for consumption by higher end analytic platforms.
  18. 18. 18 6/17/2014 Teradata Confidential HADOOP Teradata TERADATA PLATFORM FAMILY • Hadoop can be used as a staging and ETL preprocessing layer for the Data Warehouse. ETL SYSTEM ARCHITECTURE Source Data Transformed Data
  19. 19. 19 6/17/2014 Teradata Confidential • The Data Warehouse is busy with operational queries. We can reduce the workload on the DW by migrating some ETL to Hadoop. • ETL processing is a write once step, which fits Hadoop's architecture. • Hadoop can inexpensively retain the raw source data for data lineage purposes. *Note that there are many cases where this migration doesn't make sense, such as when it's necessary to do referential integrity checks. The DW is capable of handling its ETL if necessary. ETL SYSTEM ARCHITECTURE
  20. 20. 20 6/17/2014 Teradata Confidential HADOOP TERADATA PLATFORM FAMILY • Command line interface for Hadoop / TD data transfer • Batch mapreduce jobs • Bidirectional • Run on the Hadoop side TERADATA CONNNECTOR FOR HADOOP (TDCH) TDCH
  21. 21. 21 6/17/2014 Teradata Confidential hadoop jar /home/jo845b/teradata-connector-1.0.10/lib/teradata-connector- 1.0.10.jar com.teradata.hadoop.tool.TeradataExportTool -libjars $LIB_JARS -classname com.teradata.jdbc.TeraDriver -url jdbc:teradata:// -username jo845b -password Teradata14 -jobtype hcat -fileformat rcfile -method internal.fastload -sourcedatabase default -sourcetable ontime_sqoop -targettable ontime_sqoop -usexviews true • There are a plethora of options to fine-tune data transfer between Teradata and Hadoop TERADATA CONNECTOR FOR HADOOP
  22. 22. 22 6/17/2014 Teradata Confidential • Hadoop frees up the Data Warehouse's limited storage and processing resources, saving the business time and money. • Data can now be kept in its raw form, adding new data lineage capabilities to the data center. ETL SYSTEM ARCHITECTURE — BUSINESS VALUE
  23. 23. 23 6/17/2014 Teradata Confidential BANKING USE CASE Impact • Analyze multi-structured data types • Keep data confidential to those with access rights • SQL users have easy access to big data sources Situation A large national bank needed to securely and inexpensively store and analyze raw financial data in varied nonrelational formats. The data needs strict access privileges and should be generally accessible to SQL users in some way. Problem Current infrastructure is not flexible enough to handle the expected variations in data formats and processing algorithms. Security requirements are too strict for vanilla Hadoop. Solution Use Teradata Big Analytics Appliance to ingest and store the data. Data is accessed by analysts though an access layer with the data warehouse, and power users manipulate the data on the Hadoop system directly.
  24. 24. 24 6/17/2014 Teradata Confidential HADOOP TERADATA PLATFORM FAMILY Sub-queries Data Queries SECURE ACCESS ARCHITECTURE • Teradata can be used as an access layer to the data stored in Hadoop.
  25. 25. 25 6/17/2014 Teradata Confidential • Data in Hadoop can be accessed by data warehouse users with no knowledge of the inner workings of Hadoop. • The full Teradata SQL library is now available to Hadoop users • Teradata can be used as a secure gateway to limit the authentication gap in Hadoop without needing Kerberos. SECURE ACCESS ARCHITECTURE
  26. 26. 26 6/17/2014 Teradata Confidential HADOOP TERADATA PLATFORM FAMILY Query Grid Data TERADATA QUERY GRID: TERADATA DATABASE TO HADOOP • Direct data transfer from the Hadoop Distributed Filesystem • Hadoop data referenced in normal SQL queries • Transfers occur in a high speed, parallel, scalable fashion • Data can be processed on the fly or stored long-term
  27. 27. 27 6/17/2014 Teradata Confidential CREATE VIEW TOM AS ( SELECT * FROM load_from_hcatalog( USING server('') port('9083') username('hive') dbname('vim') templeton_port('1880') )); • There are a plethora of options to fine-tune data transfer between Teradata and Hadoop • Access rights on the view can limit users' access to other data sets. TERADATA QUERY GRID
  28. 28. 28 6/17/2014 Teradata Confidential • Businesses can leverage the much more widespread SQL and EDW user community instead of the small, expensive Hadoop expert community. This saves the business money. • Data can be stored inexpensively, securely, and accessibly at the same time. SECURE ACCESS ARCHITECTURE — BUSINESS VALUE
  29. 29. 29 6/17/2014 Teradata Confidential PHARMACY USE CASE Impact • Reduced storage costs for data variety • Perform adhoc analytics on the multiple versions of data • Retrieve data in minutes ( vs. days with tape archives ) • Reduced load and improved performance of DW/Databases Situation High performance storage is expensive. A Large integrated pharmacy HC providers deals With a variety of data with different business value. All data cannot be store on the same system. Ever expanding data is only adding to this challenge. Problem Long terms storage data cannot be queried and it takes a long time for retrieval. No analysis can be performed on the archived data. Losing out on business value from this valuable data. Solution Used Teradata Hadoop nodes to store all the data coming in from weblogs, medical data, JSON files. Hadoop also serves as a enrichment layer to enhance data for high-end analytics consumption. The complete solution provides easy movement of data from Hadoop, Aster and Teradata.
  30. 30. 30 6/17/2014 Teradata Confidential HADOOP TERADATA PLATFORM FAMILY ACTIVE ARCHIVE • Hadoop can be used to store the data warehouse's cold data, historical data, and regular backups. Backups Historical Data
  31. 31. 31 6/17/2014 Teradata Confidential • Using Hadoop as an active archive allows database users to access cold or historical data on the fly, unlike tape archives. • Hadoop data can be accessed in the EDW using Teradata QueryGrid: Teradata-Hadoop. • The data is no longer stored in the data warehouse, freeing valuable space. Hadoop is a less expensive platform to store this data on. ACTIVE ARCHIVE
  32. 32. 32 6/17/2014 Teradata Confidential • Storing data on Hadoop frees up cold data storage space on the relatively expensive data warehouse, saving the business money. • Compared to tape, businesses can still analyze and access their data on Hadoop. This saves time and effort. ACTIVE ARCHIVE — BUSINESS VALUE
  33. 33. 33 6/17/2014 Teradata Confidential • A successful DW / Hadoop coexistence system will see varying uses of all four of these mechanisms concurrently. • Replacing existing infrastructures with Hadoop is not a feasible goal. • In order to get Hadoop's foot in the door with large established enterprises, we need to push Hadoop as an integrated solution in tandem with a DW. CONCLUDING REMARKS PUSHING HADOOP FURTHER
  34. 34. Q&A