4. 4
“data of a very large size, typically to the extent that its
manipulation and management present significant logistical
challenges.” (Oxford English Dictionary, 2014)
“an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process using on-hand
data management tools or traditional data processing
applications.” (Wikipedia, 2014).
“datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze,”
(McKinsey, 2011)
Definitions I
Big Data
5. 5
Volume
Tiered storage/hub & spoke
Selective data retention
Statistical sampling
Redundancy elimination
Offload “cold” data
Outsourcing
Velocity
Operational data stores
Data caches
Point-to-point data routing
Balance data latency with
decision cycles
Definitions II
Big Data
Variety
Inconsistency resolution
XML-based “universal” translation
Application-aware EAI adapters
Data access middleware and ETLM
Distributed query management
Metadata management
(Laney, 2001)
23. 23
Challenge: Data Storage
Big Data
Datasilos
Structured. Well
organized, but
incomplete.
Datalakes
“Put it all in Hadoop or
some big NoSQL
database”
RicePaddies
Structured & Unstructured
data in different places.
“Datalakes in silos”
25. 25
Data (-Storage, -Streams, -Analytics) Capabilities
Big Data
Source: Rob Winters “ Billions of Rows, Millions of Insights Right now”
26. 26
Challenges Overview
Big Data
Secure
Data Governance
Accessible
Ease of use
Data Driven
Capturing new business
Improving the business
Architecture
Capturing concerns & plan
Removing
“Barriers”
Technology
Solutions
Volume, Velocity & Variety
Adaptive
Continuous change
Rapid
Development
Tooling & Process
Elastic scalable
Application agnostic
Adjustable resources
Abstract complexity
Lots of self-service
Design for scalability
Multi vendor + exit-plan
Check & verify
Automation is critical
API based
28. 28
Pets, Cattle & Chicken
Big Data
Pets: pussinboots
Build to specs
& Maintain
Traditional
Enterprise IT
Cattle: node72
Deploy, Run, Add/Delete,
& Update
Largescale
Data Processing
Chicken: application[…]
Containerized Apps
Lightweight & Stateless
Elastic scalable
applications
29. 29
Pets
Big Data
The traditional server
Build to fulfil a particular task
Failing systems get healed ASAP
Single point(s)-of-failure
Periodic downtime inevitable
Typically managed manually
(sometimes assisted by scripts)
Domain of the sys-admin
30. 30
Cattle
Big Data
Just another node in a network
No single-point(s)-of-failure
Rolling upgrades
Downtime a thing from the past
Failing systems get deleted
Managed by automation
Domain of the system
(automation) engineers
45. 45
RDBMS
Big Data
Relational DataBase Management System
- Data is structured in database tables, fields and records.
- Each table consists of database table rows.
- Each database table row consists of one or more table fields.
47. 47
Volume
More data, less
performance
And the 3V’s?
Big Data
Velocity
Ok when either
Read (OLAP) or
Write (OLTP)
But not both
Variety
Nope, just strictly
structured
Keep (re-)modeling
Other pros and cons of an RDBMS:
‘Normalization’ divides logically clustered data
Very good querying with SQL
Easy to understand and work with thanks to strict structures
49. 49
Partitioning
Big Data
Partition (database)
(From Wikipedia, the free encyclopedia)
A partition is a division of a logical database or its constituent
elements into distinct independent parts. Database partitioning
is normally done for manageability, performance or availability
reasons.
50. 50
Sharding
Big Data
Shard (database architecture)
(From Wikipedia, the free encyclopedia)
A database shard is a horizontal partition of data in a database
or search engine. Each individual partition is referred to as a
shard or database shard. Each shard is held on a separate
database server instance, to spread load.
53. 53
Volume
Easy sharding as
everything is about
the key
And the 3V’s?
Big Data
Velocity
Updating not easy
as all data for a
single key are
usually overwritten
Variety
You can go nuts,
data is just a ‘blob’
It is all up to the
user
Other pros and cons of a key-value NoSQL solution:
Ideal for short-lived data
Often support for auto TTL (Time-To-Live)
Very fast as most data (only) lives in memory
No ‘querying’ or searching, just keys
55. 55
Volume
Key based makes
sharding easier
And the 3V’s?
Big Data
Velocity
Updating possible
(documents have a
structure) but not
easy
Variety
You could go nuts,
but…
Querying expects
common elements
in structures
Other pros and cons of a NoSQL document store:
Freedom of structure for documents
Support for versioning of documents
Query performance really depends on (lack of) Variety
56. 56
column(NoSQL)
Big Data
Associates keys with
sets (families) of
columns that provide
structure to the model
for optimal
distribution of data.
(Source: smalsrech.be)
57. 57
Volume
Column families
help sharding
(partitioning)
And the 3V’s?
Big Data
Velocity
Updating possible
thanks to families
Variety
Some freedom
within a column
family but overall
structure is fixed
Other pros and cons of a NoSQL column store:
Column families can be used to exploit data locality
Complexity of designs
59. 59
Volume
Just relations, so
when/why/what in
which shard?
And the 3V’s?
Big Data
Velocity
Easy, just reroute
relations
Variety
You could go nuts,
but…
Maintaining
meaningful
relationships
requires some
thoughtOther pros and cons of a NoSQL graph store:
Easy insight into adjacent ‘documents’
Fits very well for linked-data (social)
Not an easy concept
61. 61Big Data
Build on Apache Lucene (part of Apache Solr project)
Primarily focussed on text searching (using Lucene)
But also a JSON Document store
With auto indexing based on document structure
And good querying options including proximity and ranges
62. 62
Volume
Predefined indexes
work well for
sharding
And the 3V’s?
Big Data
Velocity
Not fast but ok
thanks to JSON
structures
Not intended for
Velocity
Variety
Some, as long as
documents are
JSON
Predefined indexes
Other pros and cons of Elastic Search:
Getting the expected text-search result is not as easy as it looks
Support for versioning and TTL
Very powerful for human language related operations
68. 68
Platform Approach
Big Data
Business Users
· Seek functional solution for a
particular “job”
· Formulate the question
IT solution
IT
· Defining requirements
· Technical feasibility
· Translation to technical
design
· Build process
· Integrate
IT platform
· Build for specific “jobs”
· Value-driven
· Rich in functionality
Business Users
Business
· Decides to use particular
functionality, or not
· Explores its uses
· Subscription based
Traditional
Specification driven
Traditional
Specification driven
Cloud
Functionality driven
Cloud
Functionality driven
Ideal for
(Big) Data Analytics!
69. 69
Platform perspective
Big Data
ElasticCloud
Big Data Analytics
Hadoop, Spark, NoSQL
Connectivity
resource
Compute
resource
Storage
resource
Backup
resource
Deployment
service
Monitoring
service
Security
service
Services
Capacity
resources
Platform ...
Cloudstudio
Hosting & Development
70. 70Strategy Outlook 2015 – 2017
Challenges:
Development:
short cycles, create,
deploy, delete
Customer:
Expectation is instant
Functionality driven,
standardised & simplified
(images, recipes & services)
Elasticity:
Amazon-like
Cloud computing
(“ability to adapt to workload changes by
provisioning and de-provisioning
resources in an autonomic manner”)
Why Elastic Cloud?
76. 76
Millionsongs
Hadoop Fundamentals
A = LOAD '/datasets/millionsongs-subset.csv' using PigStorage('#');
B = FOREACH A GENERATE $0 as artist, $1 as title, $8 as location, $49 as tempo;
C = FILTER B BY NOT(tempo matches '.*,.*');
D = ORDER C BY tempo ASC;
dump D;
(Gian Marco,Te Mentiría,Peru,99.983999999999995)
(Rick Astley,Nature Boy,Newton-le-Willows, Merseyside, England,99.984999999999999)
(Joe McBride,All In,,99.989999999999995)
(Shaggy,Criteria,Kingston, Jamaica,99.994)
E = GROUP C BY location;
F = FOREACH E GENERATE group, COUNT(C) as count, $1;
G = FILTER F BY count > 10;
H = FOREACH G GENERATE group, AVG(C.tempo) as result;
I = ORDER H BY result ASC;
dump I;
(Boston, MA,135.11847619047617)
(Louisiana,135.23941666666667)
(Houston, TX,136.4752)
(Buenos Aires, Argentina,137.97354545454547)
(Gainesville, FL,144.8068235294118)
J = FILTER C BY (location matches '.*Gainesville.*');
H = ORDER J BY artist ASC;
dump H;
(Less Than Jake,Short On Ideas / One Last Cigarette,Gainesville, FL,170.41200000000001)
(Sister Hazel,Come Around (Acoustic),Gainesville, FL,195.63300000000001)