Successfully reported this slideshow.
Your SlideShare is downloading. ×

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 20 Ad

More Related Content

Slideshows for you (20)

Similar to MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin (20)

Advertisement

Recently uploaded (20)

Advertisement

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

  1. 1. BIG(GER) FASTER DATA Andrew Hood – Managing Director Cameron Gray – Data Engineer
  2. 2. Things That Are Not New Parallel Processing Distributed Computing Columnar Databases Moore’s Law Kryder’s Law 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
  3. 3. Things That Are New(ish) Cheap as Chips Cloud Computing Mature(ish) Open Source Technologies Standard Platforms (e.g. Redshift) Attitudes to External Data Hosting Time to Implementation 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 3
  4. 4. Big Data: Structured Rant 1. I really don’t care how big anyone’s data is. 2. I do care about how long something takes. 3. I do care about how much something costs. 4. Faster+Cheaper could completely change what approach (tools/techniques) I might select for a given analytical task. 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
  5. 5. Use Case 1: RDBMS Historical Transactional Data Load into Relational Database (e.g. PostgreSQL) SQL Query & Transformation 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 5 Historical Transactional Data Load into Hadoop + HIVE SQL Query & Transformation
  6. 6. Use Case 2: Crap Analytics Tool Adobe Clickstream Log/Google BigQuery Export Hadoop/ Redshift/ Impala Tableau 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
  7. 7. Data Processing Scenarios Use a relational database • Using raw clickstream/log files/other data sources • Powerful querying capabilities (SQL) • Integrates well with other tools • Can handle large data sets (>1 million rows easily) • Likely requires dedicated server and administration skillset 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
  8. 8. Data Processing Scenarios Data readily available within analytics tool • Limited by analytics tool capabilities • Limited by aggregation and pre-processing definitions • Limited by sampling (based on date range, breakdowns, number of rows) • Limited by visualisation options (e.g. charting options) 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
  9. 9. Data Processing Scenarios Export data into external tool • Microsoft Excel, Tableau, R… • More control over analysis, reporting, visualisation • Still limited by the underlying data set • Tool limitations (e.g. 1 million rows in Excel) • Limited by PC resources 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
  10. 10. When do RDBMs stop working efficiently? • Sheer volume of data to process leads to problems – Limited by database server hardware • Database can’t keep up with amount of data being inserted • Queries have increasingly long processing times – Pre-computing queries also takes longer… • Change in reporting requirements means reprocessing large amounts of historical data 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
  11. 11. What are the solutions? • Limit requirement definitions (i.e. say “not possible”… boo!) • Invest in very expensive server hardware – Gets very expensive, with diminishing returns – Single server means single point of failure – Having a backup/failover means needing another very expensive server! • Use multiple servers working together? 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
  12. 12. • The ability to spread data and processing across multiple servers in a cluster • As demand increases, just add more and more servers to the cluster • Cluster provides built-in redundancy: robust to failures of individual servers • Use technologies that scale effectively across the cluster Horizontal Scaling 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
  13. 13. Technologies 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 13 Lots of others!
  14. 14. Hadoop 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 14 Distributed File Storage Cluster Management Distributed Processing Client Applications
  15. 15. How does it all fit together? 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 15 Raw Data Import Process, Aggregate, Compute Views Export
  16. 16. Some experimental results 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 16 0 50 100 150 200 250 Standard method Hadoop - 1 node Hadoop - 2 nodes Hadoop - 3 nodes Hadoop - 4 nodes Hadoop - 5 nodes Time (s) 0 100 200 300 400 500 600 Standard method Hadoop - 1 node Hadoop - 2 nodes Hadoop - 3 nodes Hadoop - 4 nodes Hadoop - 5 nodes Time (s) Test 1 – filter test (5GB) Test 2 – aggregation test (5GB)
  17. 17. Different Tools have Different Requirements • Some tools such as Impala process as much data as possible in memory – Requires lots of RAM • Some tools such as Hive processes data mostly on disk – Requires high disk I/O – Either fast disks/SSDs or as many disks as possible 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
  18. 18. Next steps to try out yourself! • You can try out processing with Hadoop using a cloud service like Amazon Web Services • Set up an account, create a few nodes, install Hadoop • Upload some test data – the larger the better • Try running some complex data processing on the data to get an idea of the performance 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
  19. 19. Things to remember • Do testing before investing in new hardware / infrastructure – Test all tools you are interested in using with various amounts of RAM, CPU cores and I/O performance. • Sheer number of tools in Hadoop ecosystem – worth planning out what you need 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 19
  20. 20. 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 20 Thank-you www.lynchpin.com

×