Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
BIG(GER)
FASTER
DATA
Andrew Hood – Managing Director
Cameron Gray – Data Engineer
Things That Are Not New
Parallel
Processing
Distributed
Computing
Columnar
Databases
Moore’s
Law
Kryder’s
Law
21 September...
Things That Are New(ish)
Cheap as Chips
Cloud
Computing
Mature(ish)
Open Source
Technologies
Standard
Platforms (e.g.
Reds...
Big Data: Structured Rant
1. I really don’t care how big anyone’s data is.
2. I do care about how long something takes.
3....
Use Case 1: RDBMS
Historical
Transactional Data
Load into
Relational
Database (e.g.
PostgreSQL)
SQL Query &
Transformation...
Use Case 2: Crap Analytics Tool
Adobe Clickstream
Log/Google
BigQuery Export
Hadoop/ Redshift/
Impala
Tableau
21 September...
Data Processing Scenarios
Use a relational database
• Using raw clickstream/log files/other data
sources
• Powerful queryi...
Data Processing Scenarios
Data readily available within analytics tool
• Limited by analytics tool capabilities
• Limited ...
Data Processing Scenarios
Export data into external tool
• Microsoft Excel, Tableau, R…
• More control over analysis, repo...
When do RDBMs stop working
efficiently?
• Sheer volume of data to process leads to
problems
– Limited by database server h...
What are the solutions?
• Limit requirement definitions (i.e. say “not
possible”… boo!)
• Invest in very expensive server ...
• The ability to spread data and processing
across multiple servers in a cluster
• As demand increases, just add more and ...
Technologies
21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 13
Lots of others!
Hadoop
21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 14
Distributed
File Storage
Cluster
Management
...
How does it all fit together?
21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 15
Raw Data
Import
Proce...
Some experimental results
21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 16
0 50 100 150 200 250
Stan...
Different Tools have Different
Requirements
• Some tools such as Impala process as much
data as possible in memory
– Requi...
Next steps to try out yourself!
• You can try out processing with Hadoop using
a cloud service like Amazon Web Services
• ...
Things to remember
• Do testing before investing in new hardware /
infrastructure
– Test all tools you are interested in u...
21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 20
Thank-you
www.lynchpin.com
Upcoming SlideShare
Loading in …5
×

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

979 views

Published on

New technologies in big data processing for statistical and behavioural analytics

Published in: Data & Analytics
  • Be the first to comment

MeasureCamp 7 Bigger Faster Data by Andrew Hood and Cameron Gray from Lynchpin

  1. 1. BIG(GER) FASTER DATA Andrew Hood – Managing Director Cameron Gray – Data Engineer
  2. 2. Things That Are Not New Parallel Processing Distributed Computing Columnar Databases Moore’s Law Kryder’s Law 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 2
  3. 3. Things That Are New(ish) Cheap as Chips Cloud Computing Mature(ish) Open Source Technologies Standard Platforms (e.g. Redshift) Attitudes to External Data Hosting Time to Implementation 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 3
  4. 4. Big Data: Structured Rant 1. I really don’t care how big anyone’s data is. 2. I do care about how long something takes. 3. I do care about how much something costs. 4. Faster+Cheaper could completely change what approach (tools/techniques) I might select for a given analytical task. 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 4
  5. 5. Use Case 1: RDBMS Historical Transactional Data Load into Relational Database (e.g. PostgreSQL) SQL Query & Transformation 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 5 Historical Transactional Data Load into Hadoop + HIVE SQL Query & Transformation
  6. 6. Use Case 2: Crap Analytics Tool Adobe Clickstream Log/Google BigQuery Export Hadoop/ Redshift/ Impala Tableau 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 6
  7. 7. Data Processing Scenarios Use a relational database • Using raw clickstream/log files/other data sources • Powerful querying capabilities (SQL) • Integrates well with other tools • Can handle large data sets (>1 million rows easily) • Likely requires dedicated server and administration skillset 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 7
  8. 8. Data Processing Scenarios Data readily available within analytics tool • Limited by analytics tool capabilities • Limited by aggregation and pre-processing definitions • Limited by sampling (based on date range, breakdowns, number of rows) • Limited by visualisation options (e.g. charting options) 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 8
  9. 9. Data Processing Scenarios Export data into external tool • Microsoft Excel, Tableau, R… • More control over analysis, reporting, visualisation • Still limited by the underlying data set • Tool limitations (e.g. 1 million rows in Excel) • Limited by PC resources 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 9
  10. 10. When do RDBMs stop working efficiently? • Sheer volume of data to process leads to problems – Limited by database server hardware • Database can’t keep up with amount of data being inserted • Queries have increasingly long processing times – Pre-computing queries also takes longer… • Change in reporting requirements means reprocessing large amounts of historical data 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 10
  11. 11. What are the solutions? • Limit requirement definitions (i.e. say “not possible”… boo!) • Invest in very expensive server hardware – Gets very expensive, with diminishing returns – Single server means single point of failure – Having a backup/failover means needing another very expensive server! • Use multiple servers working together? 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 11
  12. 12. • The ability to spread data and processing across multiple servers in a cluster • As demand increases, just add more and more servers to the cluster • Cluster provides built-in redundancy: robust to failures of individual servers • Use technologies that scale effectively across the cluster Horizontal Scaling 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 12
  13. 13. Technologies 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 13 Lots of others!
  14. 14. Hadoop 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 14 Distributed File Storage Cluster Management Distributed Processing Client Applications
  15. 15. How does it all fit together? 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 15 Raw Data Import Process, Aggregate, Compute Views Export
  16. 16. Some experimental results 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 16 0 50 100 150 200 250 Standard method Hadoop - 1 node Hadoop - 2 nodes Hadoop - 3 nodes Hadoop - 4 nodes Hadoop - 5 nodes Time (s) 0 100 200 300 400 500 600 Standard method Hadoop - 1 node Hadoop - 2 nodes Hadoop - 3 nodes Hadoop - 4 nodes Hadoop - 5 nodes Time (s) Test 1 – filter test (5GB) Test 2 – aggregation test (5GB)
  17. 17. Different Tools have Different Requirements • Some tools such as Impala process as much data as possible in memory – Requires lots of RAM • Some tools such as Hive processes data mostly on disk – Requires high disk I/O – Either fast disks/SSDs or as many disks as possible 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 17
  18. 18. Next steps to try out yourself! • You can try out processing with Hadoop using a cloud service like Amazon Web Services • Set up an account, create a few nodes, install Hadoop • Upload some test data – the larger the better • Try running some complex data processing on the data to get an idea of the performance 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 18
  19. 19. Things to remember • Do testing before investing in new hardware / infrastructure – Test all tools you are interested in using with various amounts of RAM, CPU cores and I/O performance. • Sheer number of tools in Hadoop ecosystem – worth planning out what you need 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 19
  20. 20. 21 September 2015 © Lynchpin Analytics Limited, All Rights Reserved 20 Thank-you www.lynchpin.com

×