A presentation discussing how to deploy Big data solutions. The difference between structured reporting systems which feed business processes and the data science systems which do cool stuff
2. LEARN • NETWORK • COLLABORATE • INFLUENCE
Deploying Big Data platforms
LEARN • NETWORK • COLLABORATE • INFLUENCE
Chris Kernaghan
Principal Consultant
3. LEARN • NETWORK • COLLABORATE • INFLUENCE
Cholera epidemic first use of big data
4. LEARN • NETWORK • COLLABORATE • INFLUENCE
Big Data Epidemiology by Google
5. LEARN • NETWORK • COLLABORATE • INFLUENCE
How I really got started in Big Data
John, we need
to give Chris
more grey hair
Let’s throw him
into a Big Data
demo
8. LEARN • NETWORK • COLLABORATE • INFLUENCE
Areas of focus
Data acquisition
and curation
Data storage Compute
infrastructure
Analysis and
Insight
Everything as Code*
* Well As much as possible
9. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Acquisition and curation
Areas of focus
11. LEARN • NETWORK • COLLABORATE • INFLUENCE
How big was the Panama Papers data set
12. LEARN • NETWORK • COLLABORATE • INFLUENCE
How big was the Panama Papers data set
13. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Lake
Panama Papers Technology stack
SQL
14. LEARN • NETWORK • COLLABORATE • INFLUENCE
The tools used supported 370 journalists from
around the world
Infrastructure
was a pool of
up to 40
servers run in
AWS
15. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data quality and curation are not one time activities
Remove the human element as much as possible
16. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Data lake
– What data do you collect
– Do you have restrictions on what data can be combined
– How long does your data live
17. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Geographical concerns
– Where does your data reside
18. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data security
• Authentication
– Who is accessing your data
19. LEARN • NETWORK • COLLABORATE • INFLUENCE
Data Storage
Areas of focus
20. LEARN • NETWORK • COLLABORATE • INFLUENCE
How BIG is Big Data
22. LEARN • NETWORK • COLLABORATE • INFLUENCE
Storage Considerations
• IOPS are still important
– Big data still uses a lot of spinning disk
• Replication and Redundancy
– Eats a lot of disk space
• Build for failure
• Sometimes you have to go in-memory
23. LEARN • NETWORK • COLLABORATE • INFLUENCE
Compute infrastructure
Areas of focus
24. LEARN • NETWORK • COLLABORATE • INFLUENCE
Structured Reporting Versus Big Data/Science
Compute requirements
2
• Structured reporting systems run business processes
– Sized and static
– Under change control
– Business centric
25. LEARN • NETWORK • COLLABORATE • INFLUENCE
Structured Reporting Versus Big Data/Science
Compute requirements
2
• Data science systems answer difficult questions irregularly
– Cloud or heavy use of virtualisation
– Developer centric
– Rapidly evolving
26. LEARN • NETWORK • COLLABORATE • INFLUENCE
What you still need to remember
2
• Compute is cheap
• Scalability is critical
27. LEARN • NETWORK • COLLABORATE • INFLUENCE
What you still need to remember
2
• Software definition for consistency
• Automate as much as possible
28. LEARN • NETWORK • COLLABORATE • INFLUENCE
2
100 Hadoop
Nodes
122GB RAM
Each = 12.2TB RAM
Build time of 3Hrs
29. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
2
Disk definition
Network
defintion
Software
Install
30. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Deployment was consistent for each and every node of the
cluster
– Hostnames defined the same way
– Configuration files created the same way
31. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Faster deployment
– Automated build 3hrs to build and deploy 100 nodes
– Manual build 800hrs + to build and deploy 100 nodes
• Use of automated tools to detect failure and start new node
(ElasticBeanstalk)
32. LEARN • NETWORK • COLLABORATE • INFLUENCE
Use of scripted builds from VM to application
3
• Reusability of script
– Heavy use of parameters means it is adaptable
• Use of Git meant distributed development was handled easily
37. LEARN • NETWORK • COLLABORATE • INFLUENCE
Things to remember
• Remember the type of
platform you are using
• Storage is cheap but not
all storage is equal
• Scalability is critical
• Version control rocks
• Automate everything
you can
• Value is in the data but
not all data is valuable
• Data should not live
forever
2008 H1N1 flu pandemic in US
CDC had out of date data
Panama papers – transient use case
Under Armour – constant data use case answering lots of different questions
Common Sense Finance institution – transient audit data use case
Natures Hope – Pushing structured data into data lake to provide better temperate control as part of their data lifecycle
Intel – using event streaming to drive manufacturing processes
We are literally drowning in data – data lakes
What data do we acquire – sensor data, web data, social media, transactional data
What data is actually necessary, how long does it need to live for, what is its data life cycle
What data do we need that we do not have access to
How do we curate data for data lakes
We are literally drowning in data – data lakes
What data do we acquire – sensor data, web data, social media, transactional data
What data is actually necessary, how long does it need to live for, what is its data life cycle
What data do we need that we do not have access to
How do we curate data for data lakes
We have four developers and three journalists.
Time line
Working on Platform for 3 years across the various links
Processed Panama papers in around 12 months
How do we store data – databases and files
Big data data storage systems
HDFS
Cloud based S3 or Azure Storage
Databases – SQL and NoSQL
CSV
Hardware – massively scalable software defined infrastructures which expect failure
John broke my cluster
20 nodes – scaled to 100 nodes