Presentation for the Data Science Research Group Workshop on 7 February 2017 at AUT. The talk centres around the problem in Big Data analytics, tools for overcoming these problems, and the way the company Qrious leverages these to build solutions.
Nell’iperspazio con Rocket: il Framework Web di Rust!
Qrious about Insights -- Big Data in the Real World
1. Qrious about Insights
Big Data in the Real World
AUT DSRG Workshop
Guy Kloss
guy.kloss@qrious.co.nz
Enterprise Architect
Qrious Limited
7 February 2017
2. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 2/41
3. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Who/What is Qrious?
We help New Zealand businesses
and public sector organisations
create value
and solve their most pressing business problems
by turning data into actionable insight.
Guy Kloss | Big Data in the Real World 3/41
4. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Who/What is Qrious?
Backed by Spark
Approx. 60 employees
Offices in Auckland & Wellington
Substantial investment across Data, Platform & People
Built from the ground up
(new generation technology and working principles)
One of the largest Data Science teams in the country
with > 80% qualified to Masters & PhD level
and over 60 years of combined experience years of combined experience
NZs leading data analytics specialist by 2017
Guy Kloss | Big Data in the Real World 4/41
5. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Our Capabilities
Advanced analytics
Location insights
Big Data platforms
Consulting services
BI & Warehousing
Guy Kloss | Big Data in the Real World 5/41
6. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Who am I?
Chemical Engineer (Masters)
Rocket Scientist (German Aerospace Centre)
Computer Scientist (PhD)
Former lecturer (AUT)
Lead Software Developer and Head Crypto Geek @ Mega
Enterprise Architect at Qrious
Dad, baseballer, diver, . . . general geek!
Guy Kloss | Big Data in the Real World 6/41
7. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 7/41
8. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Data size
Number of records
Data volume
Guy Kloss | Big Data in the Real World 8/41
9. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
An exponentially growing data world
Primary Memory/Disk Capacity
Guy Kloss | Big Data in the Real World 9/41
10. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
An exponentially growing data world
Relative Speeds
Source: http://www.cs.cmu.edu/~amarp/cpu-io-gap
Guy Kloss | Big Data in the Real World 10/41
11. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Size Does Matter!
Access/processing beyond a single machine
(RAM, disk, CPU)
Expensive data transfers at volume
(latency, throughput)
Guy Kloss | Big Data in the Real World 11/41
12. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Storage Issues
Storage, access, index, find
Transfer, manage, prevent data loss
Guy Kloss | Big Data in the Real World 12/41
13. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Types of Data
Structured
Unstructured
Graphs
Free text
. . .
Guy Kloss | Big Data in the Real World 13/41
14. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Correlating . . . co-relating . . . mashing . . .
Not single record problem
But an m : n problem
Guy Kloss | Big Data in the Real World 14/41
15. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Beyond Exponential
Problems are between exponential and hyperexponential
→ Enabling data processing in an exponential world
Guy Kloss | Big Data in the Real World 15/41
16. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 16/41
17. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Number of Records
> 1 trillion (109
) records: Spark’s location based data set
Anonymised for privacy (on ingest)
Fully encrypted (at rest and in transport)
Continuous/stream ingestion
Normalisation and segmentation on data set
Correlating with external data set
→ Finding insights in this “hay mountain”
Guy Kloss | Big Data in the Real World 17/41
18. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Data Volume
100s of TB to PB of “Data Lakes”
Not just a backup/data grave
Fully encrypted (at rest and in transport)
Includes data querying and processing capability
→ Capability to “store everything” (every thing and kind)
Guy Kloss | Big Data in the Real World 18/41
19. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 19/41
20. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Divide and Conquer
Massively parallel processing: MPP
Parallelise: Map-Reduce
Pipelines: Stream processing
Guy Kloss | Big Data in the Real World 20/41
21. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Leverage Data Locality
Bring processing to the data
Guy Kloss | Big Data in the Real World 21/41
22. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
The Right Tools
Don’t re-invent the wheel
Use existing high performing tools where possible
Available high productivity frameworks, making use of high level languages
The right tool for the type of data
Use the Source, Luke!
(Leverage open source based tooling with a community)
Guy Kloss | Big Data in the Real World 22/41
23. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
The Right Data Organisation
Row vs. columnar storage
→ For analytics often better in columnar format
Guy Kloss | Big Data in the Real World 23/41
24. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
In, Out, Cha-Cha-Cha
Ingest data from (legacy, external) source systems
→ ETL – Extract, Transform, Load
Make sure the rhythm fits (no missing “Out”)
Guy Kloss | Big Data in the Real World 24/41
25. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 25/41
26. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Hadoop
Hadoop and distributions
Processing tools for relational, streaming, batch, graph, text, search, . . .
Allocates cluster resources dynamically
Data distributed (with redundancy),
so compute allocated where data is
Guy Kloss | Big Data in the Real World 26/41
27. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Hadoop Distributions
Many Hadoop distributions: Similar to Linux distributions
Cloudera Partnership with Qrious
“Bronze” partner
Ambitions to become “Silver” partner
and MSP (managed service provider)
Guy Kloss | Big Data in the Real World 27/41
28. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Basic Hadoop Tool Suite
Example: Cloudera Hadoop Distribution
Guy Kloss | Big Data in the Real World 28/41
29. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
MPP Databases
DB for massively parallel processing (MPP)
Greenplum database and forks
(based on PostgreSQL)
Guy Kloss | Big Data in the Real World 29/41
30. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Generic and Specialised DBs
Generic RDBMS (where useful)
NoSQL
Graph DB
Other columnar species
Guy Kloss | Big Data in the Real World 30/41
31. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 31/41
32. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Delivering a Suitable Solution
Includes:
System management
Connectivity
Application logic
Services
Yummy add-ons
Guy Kloss | Big Data in the Real World 32/41
33. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
System Management Framework
Security
Dedicated sub-networks with specific firewall rules
External firewalls
User and credentials management
Log collector
Other security tools . . .
System access
VPN
Remote desktop services
Guy Kloss | Big Data in the Real World 33/41
34. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Connectivity
API gateways
(Reverse) proxies
SFTP
Guy Kloss | Big Data in the Real World 34/41
35. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Application Logic
Platfor-as-a-Service (PaaS)
Huge benefits of containerising application logic (using Docker)
→ Much reduced cadence for delivery
APIs, Micro-Services
Orchestration of Big Data analysis
Guy Kloss | Big Data in the Real World 35/41
36. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Services
Solutioning, build
Analytics and development
Operation and maintenance
Guy Kloss | Big Data in the Real World 36/41
37. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Bonus Points for . . .
Provenance
(reproducibility, auditability, compliance)
AI and ML
Blockchain
(non-repudiation, trust, “smart contracts”,
identity management, federation, . . . )
Guy Kloss | Big Data in the Real World 37/41
38. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Outline
1 The Problem
2 Examples
3 The Solution
4 Tools of the Trade
5 Boxing up a Solution
6 Flotsam and Jetsam
Guy Kloss | Big Data in the Real World 38/41
39. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
In the Qrious Pipeline
Make Big Data a commodity: Don’t buy, pay what you need!
→ Big-Data-as-a-Service – BDPaaS
Sliced, diced and configured to your needs
Straight on bare metal,
not VMs (like most cloud hosters)
Guy Kloss | Big Data in the Real World 39/41
40. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Maximising the Jobmarket
What skills do you need?
RDBMS?
SAS?
NoSQL DBs?
Maybe Hadoop is a good answer?
Guy Kloss | Big Data in the Real World 40/41
41. The Problem Examples The Solution Tools of the Trade Boxing up a Solution Flotsam and Jetsam
Questions?
Parallelise!
Guy Kloss
guy.kloss@qrious.co.nz
Just a humble hair–dryer from the 30s:
“One of the first machines used for
permanent wave hairstyling back in the
1920’s and 1930’s.”
Dark Roasted Blend:
http://www.darkroastedblend.com/2007/05/
mystery-devices-issue-2.html
Guy Kloss | Big Data in the Real World 41/41