Your SlideShare is downloading. ×
  • Like
The modern analytics architecture
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

The modern analytics architecture

  • 269 views
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
269
On SlideShare
0
From Embeds
0
Number of Embeds
2

Actions

Shares
Downloads
16
Comments
0
Likes
2

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Modern Analytics Architecture Making Big Data UsefulJoseph D’Antoni, Solutions Architect Anexinet May 7-9, 2014 | San Jose, CA
  • 2. Please silence cell phones
  • 3. Joey D’Antoni Joey has over 15 years of experience with a wide variety of data platforms, in both Fortune 50 companies as well as smaller organizations He is a frequent speaker on database administration, big data, and career management He is the co-president of the Philadelphia SQL Server User’s Group He wants you to make sure you can restore your data
  • 4. Agenda • Data Warehouses—how did we get here? • Big Data—Hadoop and more • Modern Analytic Tools • Building Our New Architecture 4
  • 5. Data Warehouses—A History • Data Warehousing had it origins in the 1970s—A.C. Nielsen provided clients with data marts • In 1988—Bill Inmon (IBM) published “An Architecture for a Business Information System” • In 1996—Ralph Kimball published “The Data Warehouse Toolkit” which showcased models for OLAP style modelling 5
  • 6. Data Warehouse Models • Star Schema • Advantage is that the DW is easier to use • Facts and dimensions allow queries to perform faster • Loading and ETL become more complicated • Structure changes are very expensive Dimensional Model 6
  • 7. Data Warehouse Model • Tables are grouped by subject area (consumer, finance, products) • Tables are linked by joins • Very easy to add information into the database • Queries are harder to write, and joins can be very expensive performance wise Normalization 7
  • 8. Data Warehousing Challenges Data Quality ETL Performance and Scalability Costs—Licensing and Hardware 8
  • 9. Data Quality 9
  • 10. Extract, Transform, Load (ETL) Process 10 Some Database Business Doesn’t Care About Process Your Some Credit—Buck Woody, Microsoft
  • 11. Performance and Scalability Given the volume of data, DW queries can be very slow We use techniques like data compression to make them faster CPU was older problem— now tends to be storage 11
  • 12. Costs Data Warehouses need large servers Database systems are licensed by the size of the server (core) Data Warehouses need a whole lot fast storage Large volumes of fast storage (SANs) are expensive 12
  • 13. Traditional Solutions 13
  • 14. Classic Data Analysis Data Warehouse & BI Solutions ETL …Uses Just a Subset
  • 15. Common Technical Themes There are a lot of “big data” solutions, but most of have a lot of things in common • Built in HA/DR through multiple copies of the data • Designed for analytics processing more than OLTP • Derived from Open Source solutions • Designed around local storage and commodity hardware
  • 16. Components Of Modern Architecture Hadoop • (And it’s ecosystem) EDW Analytics Engine Visualization Engine
  • 17. Big Data Workflow for Combined Data and Analytics Data Acquire Organize Analyze Decide StructuredSemi-StructuredUn-Structured Master and Reference Transactions Machine Generated (Logs) Web Text, Image, Audio, Video DBMS (OLTP) Files NoSQL (Key Value Data Store) HDFS ETL/ELT Change Data Capture Real-Time Message- Based Hadoop MR ODS Data Warehouse Streaming (CEP Engine) In- Database Analytics Analytics • Reporting and dashboards • Alerting and recommendations • EPM, Social Apps • Text analytics and search • Advanced analytics • Interactive discovery Hardware Big Data Cluster High Speed Network RDBMS Cluster In- Memory Analytics Source—Gartner, Credit Suisse, 8/12
  • 18. Are We Leaving the RDBMS?
  • 19. CPUs 19 Hadoop Project Starts Exadata Launched
  • 20. Costs—Big Data versus Data Warehouse 20 $- $50,000.00 $100,000.00 $150,000.00 $200,000.00 $250,000.00 $300,000.00 $350,000.00 Server Storage Licensing Total Hadoop and Data Warehouse Costs Hadoop Data Warehouse • For same costs you build a 15-node Hadoop cluster • The Hadoop cluster would have 3840 GB of RAM versus the 1024 in the DW sever
  • 21. Enter the Yellow Elephant 21
  • 22. Hadoop Hadoop is the leading Big Data platform (eco-system) Invented by Yahoo • Scales Horizontally (2 socket x86 servers in massive clusters) • Uses big, slow, local storage • Extremely fault-tolerant • In a nutshell—it’s a Distributed File System (3 copies of data in cluster) and a programming framework called MapReduce
  • 23. Introducing Hadoop 23 Host 1 Name Node Host 3 Data Node Host 5 Data Node Host 2 Secondary Name Node Host 4 Data Node Host 6 Data Node
  • 24. How Map Reduce Works 24 • Automatic parallelism • Fault tolerance
  • 25. Map Phase Input File: foo.log HDFS Block 1 HDFS Block 19 HDFS Block 105 1) Read splits into records Split 1 K:0 V… Map Task 1 K:INFO V… Split 2 K:123 V… Map Task 2 K:INFO V:1 K:WARN V:1 Split 3 K:332 V… K:368 V… Map Task 3 K:Debug V:1 K:INFO V:1 2) Run Map 3) Write and Sort Output
  • 26. Hadoop Ecosystem HDFS MapReduce Note: This is only a subset of ecosystem!
  • 27. YARN
  • 28. Spark and Shark • Hadoop 2 Enhancements • Spark is in-memory • Shark integrates Spark with Hive 28
  • 29. Hadoop Architectural Decisions • Distribution • Components • Support • Cloud vs On-Premises
  • 30. Choosing Your Hadoop Distribution
  • 31. Hadoop Vendors Technology Vendor Description Hadoop Distributions Apache Completely open source software for distributed clusters and map/reduce Cloudera Industry leading commercial distribution, good management tools Hortonworks Open source distribution— Apache compatible MapR Multiple enhancements to Apache Hadoop (rewrite of HDFS), high performance, enterprise ready Pivotal HD EMC spinoff with strong financial backing, this is full high performance RDBMS (with BI connectors) on top of Hadoop
  • 32. Cloud vs On-Premises 32 • Short Term Use • Rapid Scale • Test Use Cases • Pay as you go • Internet data source • Large long term implementations • Well known workloads • Shared clusters • Large initial investment On-Premises
  • 33. Analytics Engine 33
  • 34. Analytics Hadoop is was not fast Full scans of files So How Do We Rapidly Analyze Data? 34
  • 35. Columnar Databases Microsoft SQL Server (2012 & 2014) PDW HP Vertica HBase ParAccel InfiniDB EMC Greenplum 35
  • 36. In-Memory Databases SQL Server 2014 SAP Hana Oracle Times Ten VoltDB Apache Spark 36
  • 37. Analytics Tools Past and Present 37
  • 38. 38 Data Visualization
  • 39. Tools for Data Visualization Excel (Power View and Power Map) Tableau Qlik Platfora Pentaho
  • 40. Bringing This All Together Power Query (Excel) 40 Some Database Business Doesn’t Care About Process Your Some
  • 41. Q & A ?
  • 42. Session Evaluations Submit by 5pmFriday May 9 to WIN prizes Your feedback is important and valuable. ways to access Go to passbac2014/evals Download the PASS EVENT App from your App Store and search: PASS BAC 2014 Follow the QR code link displayed on session signage throughout the conference venue and in the program guide
  • 43. for attending this session and the PASS Business Analytics Conference 2014 Thank You May 7-9, 2014 | San Jose, CA