The modern analytics architecture


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

The modern analytics architecture

  1. 1. The Modern Analytics Architecture Making Big Data UsefulJoseph D’Antoni, Solutions Architect Anexinet May 7-9, 2014 | San Jose, CA
  2. 2. Please silence cell phones
  3. 3. Joey D’Antoni Joey has over 15 years of experience with a wide variety of data platforms, in both Fortune 50 companies as well as smaller organizations He is a frequent speaker on database administration, big data, and career management He is the co-president of the Philadelphia SQL Server User’s Group He wants you to make sure you can restore your data
  4. 4. Agenda • Data Warehouses—how did we get here? • Big Data—Hadoop and more • Modern Analytic Tools • Building Our New Architecture 4
  5. 5. Data Warehouses—A History • Data Warehousing had it origins in the 1970s—A.C. Nielsen provided clients with data marts • In 1988—Bill Inmon (IBM) published “An Architecture for a Business Information System” • In 1996—Ralph Kimball published “The Data Warehouse Toolkit” which showcased models for OLAP style modelling 5
  6. 6. Data Warehouse Models • Star Schema • Advantage is that the DW is easier to use • Facts and dimensions allow queries to perform faster • Loading and ETL become more complicated • Structure changes are very expensive Dimensional Model 6
  7. 7. Data Warehouse Model • Tables are grouped by subject area (consumer, finance, products) • Tables are linked by joins • Very easy to add information into the database • Queries are harder to write, and joins can be very expensive performance wise Normalization 7
  8. 8. Data Warehousing Challenges Data Quality ETL Performance and Scalability Costs—Licensing and Hardware 8
  9. 9. Data Quality 9
  10. 10. Extract, Transform, Load (ETL) Process 10 Some Database Business Doesn’t Care About Process Your Some Credit—Buck Woody, Microsoft
  11. 11. Performance and Scalability Given the volume of data, DW queries can be very slow We use techniques like data compression to make them faster CPU was older problem— now tends to be storage 11
  12. 12. Costs Data Warehouses need large servers Database systems are licensed by the size of the server (core) Data Warehouses need a whole lot fast storage Large volumes of fast storage (SANs) are expensive 12
  13. 13. Traditional Solutions 13
  14. 14. Classic Data Analysis Data Warehouse & BI Solutions ETL …Uses Just a Subset
  15. 15. Common Technical Themes There are a lot of “big data” solutions, but most of have a lot of things in common • Built in HA/DR through multiple copies of the data • Designed for analytics processing more than OLTP • Derived from Open Source solutions • Designed around local storage and commodity hardware
  16. 16. Components Of Modern Architecture Hadoop • (And it’s ecosystem) EDW Analytics Engine Visualization Engine
  17. 17. Big Data Workflow for Combined Data and Analytics Data Acquire Organize Analyze Decide StructuredSemi-StructuredUn-Structured Master and Reference Transactions Machine Generated (Logs) Web Text, Image, Audio, Video DBMS (OLTP) Files NoSQL (Key Value Data Store) HDFS ETL/ELT Change Data Capture Real-Time Message- Based Hadoop MR ODS Data Warehouse Streaming (CEP Engine) In- Database Analytics Analytics • Reporting and dashboards • Alerting and recommendations • EPM, Social Apps • Text analytics and search • Advanced analytics • Interactive discovery Hardware Big Data Cluster High Speed Network RDBMS Cluster In- Memory Analytics Source—Gartner, Credit Suisse, 8/12
  18. 18. Are We Leaving the RDBMS?
  19. 19. CPUs 19 Hadoop Project Starts Exadata Launched
  20. 20. Costs—Big Data versus Data Warehouse 20 $- $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 Server Storage Licensing Total Hadoop and Data Warehouse Costs Hadoop Data Warehouse • For same costs you build a 15-node Hadoop cluster • The Hadoop cluster would have 3840 GB of RAM versus the 1024 in the DW sever
  21. 21. Enter the Yellow Elephant 21
  22. 22. Hadoop Hadoop is the leading Big Data platform (eco-system) Invented by Yahoo • Scales Horizontally (2 socket x86 servers in massive clusters) • Uses big, slow, local storage • Extremely fault-tolerant • In a nutshell—it’s a Distributed File System (3 copies of data in cluster) and a programming framework called MapReduce
  23. 23. Introducing Hadoop 23 Host 1 Name Node Host 3 Data Node Host 5 Data Node Host 2 Secondary Name Node Host 4 Data Node Host 6 Data Node
  24. 24. How Map Reduce Works 24 • Automatic parallelism • Fault tolerance
  25. 25. Map Phase Input File: foo.log HDFS Block 1 HDFS Block 19 HDFS Block 105 1) Read splits into records Split 1 K:0 V… Map Task 1 K:INFO V… Split 2 K:123 V… Map Task 2 K:INFO V:1 K:WARN V:1 Split 3 K:332 V… K:368 V… Map Task 3 K:Debug V:1 K:INFO V:1 2) Run Map 3) Write and Sort Output
  26. 26. Hadoop Ecosystem HDFS MapReduce Note: This is only a subset of ecosystem!
  27. 27. YARN
  28. 28. Spark and Shark • Hadoop 2 Enhancements • Spark is in-memory • Shark integrates Spark with Hive 28
  29. 29. Hadoop Architectural Decisions • Distribution • Components • Support • Cloud vs On-Premises
  30. 30. Choosing Your Hadoop Distribution
  31. 31. Hadoop Vendors Technology Vendor Description Hadoop Distributions Apache Completely open source software for distributed clusters and map/reduce Cloudera Industry leading commercial distribution, good management tools Hortonworks Open source distribution— Apache compatible MapR Multiple enhancements to Apache Hadoop (rewrite of HDFS), high performance, enterprise ready Pivotal HD EMC spinoff with strong financial backing, this is full high performance RDBMS (with BI connectors) on top of Hadoop
  32. 32. Cloud vs On-Premises 32 • Short Term Use • Rapid Scale • Test Use Cases • Pay as you go • Internet data source • Large long term implementations • Well known workloads • Shared clusters • Large initial investment On-Premises
  33. 33. Analytics Engine 33
  34. 34. Analytics Hadoop is was not fast Full scans of files So How Do We Rapidly Analyze Data? 34
  35. 35. Columnar Databases Microsoft SQL Server (2012 & 2014) PDW HP Vertica HBase ParAccel InfiniDB EMC Greenplum 35
  36. 36. In-Memory Databases SQL Server 2014 SAP Hana Oracle Times Ten VoltDB Apache Spark 36
  37. 37. Analytics Tools Past and Present 37
  38. 38. 38 Data Visualization
  39. 39. Tools for Data Visualization Excel (Power View and Power Map) Tableau Qlik Platfora Pentaho
  40. 40. Bringing This All Together Power Query (Excel) 40 Some Database Business Doesn’t Care About Process Your Some
  41. 41. Q & A ?
  42. 42. Session Evaluations Submit by 5pmFriday May 9 to WIN prizes Your feedback is important and valuable. ways to access Go to passbac2014/evals Download the PASS EVENT App from your App Store and search: PASS BAC 2014 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide
  43. 43. for attending this session and the PASS Business Analytics Conference 2014 Thank You May 7-9, 2014 | San Jose, CA