BigData Behind-the-Scenes~20150827

Big Data
Behind the Scenes
August 27th 2015
Anthony Potappel | (Big) Data Engineer
Patrick Beitsma | (Big) Data Engineer

2
10.00 – 10.30
Introduction & Expectations
10.30 – 11.00
What is Big Data?
Big Data & IT
Challenges
Automation
11.00 – 11.10
Coffee break
11.10 – 11.40
(Big) Data Technologies:
Databases: (No)SQL
Program
Big Data
11.40 – 11.50
Coffee break
11.50 – 12.20
Hadoop (/Spark) Platform
Examples & Demo’s
12.20 – 13.00
Lunch
13.00 – 14.00
Datacenter tour

4
“data of a very large size, typically to the extent that its
manipulation and management present significant logistical
challenges.” (Oxford English Dictionary, 2014)
“an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process using on-hand
data management tools or traditional data processing
applications.” (Wikipedia, 2014).
“datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze,”
(McKinsey, 2011)
Definitions I
Big Data

5
Volume
Tiered storage/hub & spoke
Selective data retention
Statistical sampling
Redundancy elimination
Offload “cold” data
Outsourcing
Velocity
Operational data stores
Data caches
Point-to-point data routing
Balance data latency with
decision cycles
Definitions II
Big Data
Variety
Inconsistency resolution
XML-based “universal” translation
Application-aware EAI adapters
Data access middleware and ETLM
Distributed query management
Metadata management
(Laney, 2001)

6
Data sources
Big Data
Source: Hortonworks & Teradata/ Vesselhead.com

7
The Data Revolution
Titel van de presentatie
Source: https://www.youtube.com/watch?v=LrNlZ7-SMPk

8
Volume
Big Data
Source: EMC/ IDC
40-45% Yearly growth in data volume

9
Connected devices I
Big Data
Source: HBR/ BI Intelligence

10
Connected devices II
Big Data
Source: http://chipestimate.com Source: http://motherboard.vice.com

12Big Data
Big Data
Advanced
Analytics
Source: Gartner/ Data Science Central

13
Social, Mobile, Analytics & Cloud (SMAC)
Big Data
Source: Cognizant: Don’t get SMACked

15
Business Intelligence
&
Big Data
Big Data

16
A new approach I
Big Data
Source: Datasciencecentral.com

17
Business Intelligence vs. Data Science
Big Data
Source: EMC

18
Improving Return-on-Investment
Big Data
Source: http://www.threedeedigital.com/big-data-analytics-customer-acquisition-strategy/

19
Mapping
the
Challenges
Big Data

20
Technical Requirements
Big Data
Analytics Platform
Volume • More data = more information
• Large scale data processing
• Linear scalable
• Broadband network
• High I/O throughput
Variety • Combining sources, ->
(heavy) ETL
• Multiple databases/ -stores
• Modern Application platform
• Expanded ‘toolkit’
• Modulair configurable
• Rapid Platform Development
• Storage options
Velocity • Realtime data processing
• High performance compute
• Excellent network
connectivity
• Compute & Memory
intensive

21
Challenge: complexity
Big Data
Source: StackIQ

22
Challenge: Lots of applications
Big Data
Source: Datafloq

23
Challenge: Data Storage
Big Data
Datasilos
Structured. Well
organized, but
incomplete.
Datalakes
“Put it all in Hadoop or
some big NoSQL
database”
RicePaddies
Structured & Unstructured
data in different places.
“Datalakes in silos”

24
Challenge: Streaming, The Apps, Data & Analytics cycle
Big Data

25
Data (-Storage, -Streams, -Analytics) Capabilities
Big Data
Source: Rob Winters “ Billions of Rows, Millions of Insights Right now”

26
Challenges Overview
Big Data
Secure
Data Governance
Accessible
Ease of use
Data Driven
Capturing new business
Improving the business
Architecture
Capturing concerns & plan
Removing
“Barriers”
Technology
Solutions
Volume, Velocity & Variety
Adaptive
Continuous change
Rapid
Development
Tooling & Process
Elastic scalable
Application agnostic
Adjustable resources
Abstract complexity
Lots of self-service
Design for scalability
Multi vendor + exit-plan
Check & verify
Automation is critical
API based

27
Automation
at
Massive scale
Big Data

28
Pets, Cattle & Chicken
Big Data
Pets: pussinboots
Build to specs
& Maintain
Traditional
Enterprise IT
Cattle: node72
Deploy, Run, Add/Delete,
& Update
Largescale
Data Processing
Chicken: application[…]
Containerized Apps
Lightweight & Stateless
Elastic scalable
applications

29
Pets
Big Data
The traditional server
Build to fulfil a particular task
Failing systems get healed ASAP
Single point(s)-of-failure
Periodic downtime inevitable
Typically managed manually
(sometimes assisted by scripts)
Domain of the sys-admin

30
Cattle
Big Data
Just another node in a network
No single-point(s)-of-failure
Rolling upgrades
Downtime a thing from the past
Failing systems get deleted
Managed by automation
Domain of the system
(automation) engineers

31
Chicken
Big Data
Applications only
Inherits characteristics from
cattle
Added abstraction
Added efficiency
However: N/A for the data-
platform itself
Domain of the Developer

35
Continuous (a.k.a. Rapid) Development
Big Data
Rebuild
Playscripts
Run
Platform
Improve

36
Flexible
resources
Big Data

37
Lambda: a need for speed
Big Data
Source: YMC.ch

38
Big Data: Platform Layout(s)
Big Data

39
Availability
Zones
Big Data

40
Reliability over unreliable infrastructure
Big Data
Source: The Bosting Consulting Group
Twisted pair Coaxial Fiber Spectrum
Ethernet PPP CDMA IEEE 802
IP
TCP UDP
HTTP SMTP RTP
Browser E-mail VOIP client
Innovation
Experimentation
Personalization
Scale
Utilization
“End-to-end
Principle”
“End-to-end
Principle”
...
...
...
...
...

42
(Big) Data
technologies
Big Data

43Big Data
(source: tomitspro.com)
Storing Data
NoSQL Traditional databases

44Big Data
(Source:datasciencecentral.com)

45
RDBMS
Big Data
Relational DataBase Management System
- Data is structured in database tables, fields and records.
- Each table consists of database table rows.
- Each database table row consists of one or more table fields.

47
Volume
More data, less
performance
And the 3V’s?
Big Data
Velocity
Ok when either
Read (OLAP) or
Write (OLTP)
But not both
Variety
Nope, just strictly
structured
Keep (re-)modeling
Other pros and cons of an RDBMS:
‘Normalization’ divides logically clustered data
Very good querying with SQL
Easy to understand and work with thanks to strict structures

49
Partitioning
Big Data
Partition (database)
(From Wikipedia, the free encyclopedia)
A partition is a division of a logical database or its constituent
elements into distinct independent parts. Database partitioning
is normally done for manageability, performance or availability
reasons.

50
Sharding
Big Data
Shard (database architecture)
(From Wikipedia, the free encyclopedia)
A database shard is a horizontal partition of data in a database
or search engine. Each individual partition is referred to as a
shard or database shard. Each shard is held on a separate
database server instance, to spread load.

52
key-value(NoSQL)
Big Data
Collection of key/value
pairs where key is unique
identifier, and value is an
arbitrary piece of data.
(Source: smalsrech.be)

53
Volume
Easy sharding as
everything is about
the key
And the 3V’s?
Big Data
Velocity
Updating not easy
as all data for a
single key are
usually overwritten
Variety
You can go nuts,
data is just a ‘blob’
It is all up to the
user
Other pros and cons of a key-value NoSQL solution:
Ideal for short-lived data
Often support for auto TTL (Time-To-Live)
Very fast as most data (only) lives in memory
No ‘querying’ or searching, just keys

54
document(NoSQL)
Big Data
Similar to key/value
store, only values are
documents with
implicit schematic
structure.

55
Volume
Key based makes
sharding easier
And the 3V’s?
Big Data
Velocity
Updating possible
(documents have a
structure) but not
easy
Variety
You could go nuts,
but…
Querying expects
common elements
in structures
Other pros and cons of a NoSQL document store:
Freedom of structure for documents
Support for versioning of documents
Query performance really depends on (lack of) Variety

56
column(NoSQL)
Big Data
Associates keys with
sets (families) of
columns that provide
structure to the model
for optimal
distribution of data.

57
Volume
Column families
help sharding
(partitioning)
And the 3V’s?
Big Data
Velocity
Updating possible
thanks to families
Variety
Some freedom
within a column
family but overall
structure is fixed
Other pros and cons of a NoSQL column store:
Column families can be used to exploit data locality
Complexity of designs

58
graph(NoSQL)
Big Data
Emphasizes the
relationships between
items through a
flexible ‘web’ as
opposed to rigid
structure.

59
Volume
Just relations, so
when/why/what in
which shard?
And the 3V’s?
Big Data
Velocity
Easy, just reroute
relations
Variety
You could go nuts,
but…
Maintaining
meaningful
relationships
requires some
thoughtOther pros and cons of a NoSQL graph store:
Easy insight into adjacent ‘documents’
Fits very well for linked-data (social)
Not an easy concept

60
What about
Elastic Search?
Big Data

61Big Data
Build on Apache Lucene (part of Apache Solr project)
Primarily focussed on text searching (using Lucene)
But also a JSON Document store
With auto indexing based on document structure
And good querying options including proximity and ranges

62
Volume
Predefined indexes
work well for
sharding
And the 3V’s?
Big Data
Velocity
Not fast but ok
thanks to JSON
structures
Not intended for
Velocity
Variety
Some, as long as
documents are
JSON
Predefined indexes
Other pros and cons of Elastic Search:
Getting the expected text-search result is not as easy as it looks
Support for versioning and TTL
Very powerful for human language related operations

63
So what should
we use?
Big Data

64Big Data
(Source:martinfowler.com)

66
It is all about
knowing your
(big) data.
Big Data

68
Platform Approach
Big Data
Business Users
· Seek functional solution for a
particular “job”
· Formulate the question
IT solution
IT
· Defining requirements
· Technical feasibility
· Translation to technical
design
· Build process
· Integrate
IT platform
· Build for specific “jobs”
· Value-driven
· Rich in functionality
Business Users
Business
· Decides to use particular
functionality, or not
· Explores its uses
· Subscription based
Traditional
Specification driven
Traditional
Specification driven
Cloud
Functionality driven
Cloud
Functionality driven
Ideal for
(Big) Data Analytics!

69
Platform perspective
Big Data
ElasticCloud
Big Data Analytics
Hadoop, Spark, NoSQL
Connectivity
resource
Compute
resource
Storage
resource
Backup
resource
Deployment
service
Monitoring
service
Security
service
Services
Capacity
resources
Platform ...
Cloudstudio
Hosting & Development

70Strategy Outlook 2015 – 2017
Challenges:
Development:
short cycles, create,
deploy, delete
Customer:
Expectation is instant
Functionality driven,
standardised & simplified
(images, recipes & services)
Elasticity:
Amazon-like
Cloud computing
(“ability to adapt to workload changes by
provisioning and de-provisioning
resources in an autonomic manner”)
Why Elastic Cloud?

71
Big Data Analytics Platform: Node layout
Big Data

73
Deployment
Hadoop Fundamentals

74
PIG version
Hadoop Fundamentals

75
Spark version
Hadoop Fundamentals

76
Millionsongs
Hadoop Fundamentals
A = LOAD '/datasets/millionsongs-subset.csv' using PigStorage('#');
B = FOREACH A GENERATE $0 as artist, $1 as title, $8 as location, $49 as tempo;
C = FILTER B BY NOT(tempo matches '.*,.*');
D = ORDER C BY tempo ASC;
dump D;
(Gian Marco,Te Mentiría,Peru,99.983999999999995)
(Rick Astley,Nature Boy,Newton-le-Willows, Merseyside, England,99.984999999999999)
(Joe McBride,All In,,99.989999999999995)
(Shaggy,Criteria,Kingston, Jamaica,99.994)
E = GROUP C BY location;
F = FOREACH E GENERATE group, COUNT(C) as count, $1;
G = FILTER F BY count > 10;
H = FOREACH G GENERATE group, AVG(C.tempo) as result;
I = ORDER H BY result ASC;
dump I;
(Boston, MA,135.11847619047617)
(Louisiana,135.23941666666667)
(Houston, TX,136.4752)
(Buenos Aires, Argentina,137.97354545454547)
(Gainesville, FL,144.8068235294118)
J = FILTER C BY (location matches '.*Gainesville.*');
H = ORDER J BY artist ASC;
dump H;
(Less Than Jake,Short On Ideas / One Last Cigarette,Gainesville, FL,170.41200000000001)
(Sister Hazel,Come Around (Acoustic),Gainesville, FL,195.63300000000001)

77
Recommendation engine
Hadoop Fundamentals

Improve continuously
Add Functionality
“Building Platforms
& Components”
Help wanted!
The road ahead

BigData Behind-the-Scenes~20150827

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to BigData Behind-the-Scenes~20150827

Similar to BigData Behind-the-Scenes~20150827 (20)

BigData Behind-the-Scenes~20150827