SlideShare a Scribd company logo
1 of 79
Download to read offline
Big Data
Behind the Scenes
August 27th 2015
Anthony Potappel | (Big) Data Engineer
Patrick Beitsma | (Big) Data Engineer
2
10.00 – 10.30
Introduction & Expectations
10.30 – 11.00
What is Big Data?
Big Data & IT
Challenges
Automation
11.00 – 11.10
Coffee break
11.10 – 11.40
(Big) Data Technologies:
Databases: (No)SQL
Program
Big Data
11.40 – 11.50
Coffee break
11.50 – 12.20
Hadoop (/Spark) Platform
Examples & Demo’s
12.20 – 13.00
Lunch
13.00 – 14.00
Datacenter tour
3
What
is
Big Data?
Big Data
4
“data of a very large size, typically to the extent that its
manipulation and management present significant logistical
challenges.” (Oxford English Dictionary, 2014)
“an all-encompassing term for any collection of data sets so large
and complex that it becomes difficult to process using on-hand
data management tools or traditional data processing
applications.” (Wikipedia, 2014).
“datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze,”
(McKinsey, 2011)
Definitions I
Big Data
5
Volume
Tiered storage/hub & spoke
Selective data retention
Statistical sampling
Redundancy elimination
Offload “cold” data
Outsourcing
Velocity
Operational data stores
Data caches
Point-to-point data routing
Balance data latency with
decision cycles
Definitions II
Big Data
Variety
Inconsistency resolution
XML-based “universal” translation
Application-aware EAI adapters
Data access middleware and ETLM
Distributed query management
Metadata management
(Laney, 2001)
6
Data sources
Big Data
Source: Hortonworks & Teradata/ Vesselhead.com
7
The Data Revolution
Titel van de presentatie
Source: https://www.youtube.com/watch?v=LrNlZ7-SMPk
8
Volume
Big Data
Source: EMC/ IDC
40-45% Yearly growth in data volume
9
Connected devices I
Big Data
Source: HBR/ BI Intelligence
10
Connected devices II
Big Data
Source: http://chipestimate.com Source: http://motherboard.vice.com
11
Big Data
&
IT
Big Data
12Big Data
Big Data
Advanced
Analytics
Source: Gartner/ Data Science Central
13
Social, Mobile, Analytics & Cloud (SMAC)
Big Data
Source: Cognizant: Don’t get SMACked
14Big Data
15
Business Intelligence
&
Big Data
Big Data
16
A new approach I
Big Data
Source: Datasciencecentral.com
17
Business Intelligence vs. Data Science
Big Data
Source: EMC
18
Improving Return-on-Investment
Big Data
Source: http://www.threedeedigital.com/big-data-analytics-customer-acquisition-strategy/
19
Mapping
the
Challenges
Big Data
20
Technical Requirements
Big Data
Analytics Platform
Volume • More data = more information
• Large scale data processing
• Linear scalable
• Broadband network
• High I/O throughput
Variety • Combining sources, ->
(heavy) ETL
• Multiple databases/ -stores
• Modern Application platform
• Expanded ‘toolkit’
• Modulair configurable
• Rapid Platform Development
• Storage options
Velocity • Realtime data processing
• High performance compute
• Excellent network
connectivity
• Compute & Memory
intensive
21
Challenge: complexity
Big Data
Source: StackIQ
22
Challenge: Lots of applications
Big Data
Source: Datafloq
23
Challenge: Data Storage
Big Data
Datasilos
Structured. Well
organized, but
incomplete.
Datalakes
“Put it all in Hadoop or
some big NoSQL
database”
RicePaddies
Structured & Unstructured
data in different places.
“Datalakes in silos”
24
Challenge: Streaming, The Apps, Data & Analytics cycle
Big Data
25
Data (-Storage, -Streams, -Analytics) Capabilities
Big Data
Source: Rob Winters “ Billions of Rows, Millions of Insights Right now”
26
Challenges Overview
Big Data
Secure
Data Governance
Accessible
Ease of use
Data Driven
Capturing new business
Improving the business
Architecture
Capturing concerns & plan
Removing
“Barriers”
Technology
Solutions
Volume, Velocity & Variety
Adaptive
Continuous change
Rapid
Development
Tooling & Process
Elastic scalable
Application agnostic
Adjustable resources
Abstract complexity
Lots of self-service
Design for scalability
Multi vendor + exit-plan
Check & verify
Automation is critical
API based
27
Automation
at
Massive scale
Big Data
28
Pets, Cattle & Chicken
Big Data
Pets: pussinboots
Build to specs
& Maintain
Traditional
Enterprise IT
Cattle: node72
Deploy, Run, Add/Delete,
& Update
Largescale
Data Processing
Chicken: application[…]
Containerized Apps
Lightweight & Stateless
Elastic scalable
applications
29
Pets
Big Data
The traditional server
Build to fulfil a particular task
Failing systems get healed ASAP
Single point(s)-of-failure
Periodic downtime inevitable
Typically managed manually
(sometimes assisted by scripts)
Domain of the sys-admin
30
Cattle
Big Data
Just another node in a network
No single-point(s)-of-failure
Rolling upgrades
Downtime a thing from the past
Failing systems get deleted
Managed by automation
Domain of the system
(automation) engineers
31
Chicken
Big Data
Applications only
Inherits characteristics from
cattle
Added abstraction
Added efficiency
However: N/A for the data-
platform itself
Domain of the Developer
32
Automation
Big data
33
Traditional
Big data
34
Agile
Big data
35
Continuous (a.k.a. Rapid) Development
Big Data
Rebuild
Playscripts
Run
Platform
Improve
36
Flexible
resources
Big Data
37
Lambda: a need for speed
Big Data
Source: YMC.ch
38
Big Data: Platform Layout(s)
Big Data
39
Availability
Zones
Big Data
40
Reliability over unreliable infrastructure
Big Data
Source: The Bosting Consulting Group
Twisted pair Coaxial Fiber Spectrum
Ethernet PPP CDMA IEEE 802
IP
TCP UDP
HTTP SMTP RTP
Browser E-mail VOIP client
Innovation
Experimentation
Personalization
Scale
Utilization
“End-to-end
Principle”
“End-to-end
Principle”
...
...
...
...
...
41
Cloud stacks
Big Data
42
(Big) Data
technologies
Big Data
43Big Data
(source: tomitspro.com)
Storing Data
NoSQL Traditional databases
44Big Data
(Source:datasciencecentral.com)
45
RDBMS
Big Data
Relational DataBase Management System
- Data is structured in database tables, fields and records.
- Each table consists of database table rows.
- Each database table row consists of one or more table fields.
46
Schema
Big Data
47
Volume
More data, less
performance
And the 3V’s?
Big Data
Velocity
Ok when either
Read (OLAP) or
Write (OLTP)
But not both
Variety
Nope, just strictly
structured
Keep (re-)modeling
Other pros and cons of an RDBMS:
‘Normalization’ divides logically clustered data
Very good querying with SQL
Easy to understand and work with thanks to strict structures
48Big Data
49
Partitioning
Big Data
Partition (database)
(From Wikipedia, the free encyclopedia)
A partition is a division of a logical database or its constituent
elements into distinct independent parts. Database partitioning
is normally done for manageability, performance or availability
reasons.
50
Sharding
Big Data
Shard (database architecture)
(From Wikipedia, the free encyclopedia)
A database shard is a horizontal partition of data in a database
or search engine. Each individual partition is referred to as a
shard or database shard. Each shard is held on a separate
database server instance, to spread load.
51Big Data
52
key-value(NoSQL)
Big Data
Collection of key/value
pairs where key is unique
identifier, and value is an
arbitrary piece of data.
(Source: smalsrech.be)
53
Volume
Easy sharding as
everything is about
the key
And the 3V’s?
Big Data
Velocity
Updating not easy
as all data for a
single key are
usually overwritten
Variety
You can go nuts,
data is just a ‘blob’
It is all up to the
user
Other pros and cons of a key-value NoSQL solution:
Ideal for short-lived data
Often support for auto TTL (Time-To-Live)
Very fast as most data (only) lives in memory
No ‘querying’ or searching, just keys
54
document(NoSQL)
Big Data
Similar to key/value
store, only values are
documents with
implicit schematic
structure.
(Source: smalsrech.be)
55
Volume
Key based makes
sharding easier
And the 3V’s?
Big Data
Velocity
Updating possible
(documents have a
structure) but not
easy
Variety
You could go nuts,
but…
Querying expects
common elements
in structures
Other pros and cons of a NoSQL document store:
Freedom of structure for documents
Support for versioning of documents
Query performance really depends on (lack of) Variety
56
column(NoSQL)
Big Data
Associates keys with
sets (families) of
columns that provide
structure to the model
for optimal
distribution of data.
(Source: smalsrech.be)
57
Volume
Column families
help sharding
(partitioning)
And the 3V’s?
Big Data
Velocity
Updating possible
thanks to families
Variety
Some freedom
within a column
family but overall
structure is fixed
Other pros and cons of a NoSQL column store:
Column families can be used to exploit data locality
Complexity of designs
58
graph(NoSQL)
Big Data
Emphasizes the
relationships between
items through a
flexible ‘web’ as
opposed to rigid
structure.
(Source: smalsrech.be)
59
Volume
Just relations, so
when/why/what in
which shard?
And the 3V’s?
Big Data
Velocity
Easy, just reroute
relations
Variety
You could go nuts,
but…
Maintaining
meaningful
relationships
requires some
thoughtOther pros and cons of a NoSQL graph store:
Easy insight into adjacent ‘documents’
Fits very well for linked-data (social)
Not an easy concept
60
What about
Elastic Search?
Big Data
61Big Data
Build on Apache Lucene (part of Apache Solr project)
Primarily focussed on text searching (using Lucene)
But also a JSON Document store
With auto indexing based on document structure
And good querying options including proximity and ranges
62
Volume
Predefined indexes
work well for
sharding
And the 3V’s?
Big Data
Velocity
Not fast but ok
thanks to JSON
structures
Not intended for
Velocity
Variety
Some, as long as
documents are
JSON
Predefined indexes
Other pros and cons of Elastic Search:
Getting the expected text-search result is not as easy as it looks
Support for versioning and TTL
Very powerful for human language related operations
63
So what should
we use?
Big Data
64Big Data
(Source:martinfowler.com)
65
And Facebook?
Big Data
66
It is all about
knowing your
(big) data.
Big Data
67
Platforms
Big Data
68
Platform Approach
Big Data
Business Users
· Seek functional solution for a
particular “job”
· Formulate the question
IT solution
IT
· Defining requirements
· Technical feasibility
· Translation to technical
design
· Build process
· Integrate
IT platform
· Build for specific “jobs”
· Value-driven
· Rich in functionality
Business Users
Business
· Decides to use particular
functionality, or not
· Explores its uses
· Subscription based
Traditional
Specification driven
Traditional
Specification driven
Cloud
Functionality driven
Cloud
Functionality driven
Ideal for
(Big) Data Analytics!
69
Platform perspective
Big Data
ElasticCloud
Big Data Analytics
Hadoop, Spark, NoSQL
Connectivity
resource
Compute
resource
Storage
resource
Backup
resource
Deployment
service
Monitoring
service
Security
service
Services
Capacity
resources
Platform ...
Cloudstudio
Hosting & Development
70Strategy Outlook 2015 – 2017
Challenges:
Development:
short cycles, create,
deploy, delete
Customer:
Expectation is instant
Functionality driven,
standardised & simplified
(images, recipes & services)
Elasticity:
Amazon-like
Cloud computing
(“ability to adapt to workload changes by
provisioning and de-provisioning
resources in an autonomic manner”)
Why Elastic Cloud?
71
Big Data Analytics Platform: Node layout
Big Data
72
Live
Examples
Big Data
73
Deployment
Hadoop Fundamentals
74
PIG version
Hadoop Fundamentals
75
Spark version
Hadoop Fundamentals
76
Millionsongs
Hadoop Fundamentals
A = LOAD '/datasets/millionsongs-subset.csv' using PigStorage('#');
B = FOREACH A GENERATE $0 as artist, $1 as title, $8 as location, $49 as tempo;
C = FILTER B BY NOT(tempo matches '.*,.*');
D = ORDER C BY tempo ASC;
dump D;
(Gian Marco,Te Mentiría,Peru,99.983999999999995)
(Rick Astley,Nature Boy,Newton-le-Willows, Merseyside, England,99.984999999999999)
(Joe McBride,All In,,99.989999999999995)
(Shaggy,Criteria,Kingston, Jamaica,99.994)
E = GROUP C BY location;
F = FOREACH E GENERATE group, COUNT(C) as count, $1;
G = FILTER F BY count > 10;
H = FOREACH G GENERATE group, AVG(C.tempo) as result;
I = ORDER H BY result ASC;
dump I;
(Boston, MA,135.11847619047617)
(Louisiana,135.23941666666667)
(Houston, TX,136.4752)
(Buenos Aires, Argentina,137.97354545454547)
(Gainesville, FL,144.8068235294118)
J = FILTER C BY (location matches '.*Gainesville.*');
H = ORDER J BY artist ASC;
dump H;
(Less Than Jake,Short On Ideas / One Last Cigarette,Gainesville, FL,170.41200000000001)
(Sister Hazel,Come Around (Acoustic),Gainesville, FL,195.63300000000001)
77
Recommendation engine
Hadoop Fundamentals
78
Finally
Big Data
Improve continuously
Add Functionality
“Building Platforms
& Components”
Help wanted!
The road ahead

More Related Content

What's hot

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time ApplicationsDataWorks Summit
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTAmrit Chhetri
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Acunu
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big DataForwardSprint
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataVipin Batra
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An OverviewC. Scyphers
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...Dataconomy Media
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asiaMuhammad Rifqi
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2RojaT4
 
Urika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageUrika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageAdnan Khaleel
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014Stratebi
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata Mk Kim
 

What's hot (20)

Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Big Data Tech Stack
Big Data Tech StackBig Data Tech Stack
Big Data Tech Stack
 
Big Data Real Time Applications
Big Data Real Time ApplicationsBig Data Real Time Applications
Big Data Real Time Applications
 
BigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRTBigData Analytics with Hadoop and BIRT
BigData Analytics with Hadoop and BIRT
 
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
Cassandra EU 2012 - Overview of Case Studies and State of the Market by 451 R...
 
An introduction to Big Data
An introduction to Big DataAn introduction to Big Data
An introduction to Big Data
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big data analysis concepts and references
Big data analysis concepts and referencesBig data analysis concepts and references
Big data analysis concepts and references
 
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...Dev Lakhani, Data Scientist at Batch Insights  "Real Time Big Data Applicatio...
Dev Lakhani, Data Scientist at Batch Insights "Real Time Big Data Applicatio...
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Dw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhanDw 07032018-dr pl pradhan
Dw 07032018-dr pl pradhan
 
Open source stak of big data techs open suse asia
Open source stak of big data techs   open suse asiaOpen source stak of big data techs   open suse asia
Open source stak of big data techs open suse asia
 
Big data unit 2
Big data unit 2Big data unit 2
Big data unit 2
 
Big data analytics - hadoop
Big data analytics - hadoopBig data analytics - hadoop
Big data analytics - hadoop
 
Urika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-pageUrika-GD Product Brief Online 5-page
Urika-GD Product Brief Online 5-page
 
Big Data Analytics 2014
Big Data Analytics 2014Big Data Analytics 2014
Big Data Analytics 2014
 
Bio bigdata
Bio bigdata Bio bigdata
Bio bigdata
 
BDaas- BigData as a service
BDaas- BigData as a service  BDaas- BigData as a service
BDaas- BigData as a service
 

Similar to BigData Behind-the-Scenes~20150827

Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Anthony Potappel
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseStavros Papadopoulos
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)James Serra
 
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data InitiativeBig Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data InitiativeDenodo
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQLPhilippe Julio
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lakepunedevscom
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - CassandraJen Wei Lee
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataDenny Lee
 
Architecting Data in the AWS Ecosystem
Architecting Data in the AWS EcosystemArchitecting Data in the AWS Ecosystem
Architecting Data in the AWS EcosystemSingleStore
 

Similar to BigData Behind-the-Scenes~20150827 (20)

Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622Making BD Work~TIAS_20150622
Making BD Work~TIAS_20150622
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
tecFinal 451 webinar deck
tecFinal 451 webinar decktecFinal 451 webinar deck
tecFinal 451 webinar deck
 
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal DatabaseDebunking "Purpose-Built Data Systems:": Enter the Universal Database
Debunking "Purpose-Built Data Systems:": Enter the Universal Database
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)Data Lakehouse, Data Mesh, and Data Fabric (r2)
Data Lakehouse, Data Mesh, and Data Fabric (r2)
 
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data InitiativeBig Data Fabric: A Necessity For Any Successful Big Data Initiative
Big Data Fabric: A Necessity For Any Successful Big Data Initiative
 
Big Data with Not Only SQL
Big Data with Not Only SQLBig Data with Not Only SQL
Big Data with Not Only SQL
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Designing modern dw and data lake
Designing modern dw and data lakeDesigning modern dw and data lake
Designing modern dw and data lake
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Big data - Cassandra
Big data - CassandraBig data - Cassandra
Big data - Cassandra
 
NOSQL
NOSQLNOSQL
NOSQL
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
SQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big DataSQLCAT: Tier-1 BI in the World of Big Data
SQLCAT: Tier-1 BI in the World of Big Data
 
Architecting Data in the AWS Ecosystem
Architecting Data in the AWS EcosystemArchitecting Data in the AWS Ecosystem
Architecting Data in the AWS Ecosystem
 
Big Data SE vs. SE for Big Data
Big Data SE vs. SE for Big DataBig Data SE vs. SE for Big Data
Big Data SE vs. SE for Big Data
 

BigData Behind-the-Scenes~20150827

  • 1. Big Data Behind the Scenes August 27th 2015 Anthony Potappel | (Big) Data Engineer Patrick Beitsma | (Big) Data Engineer
  • 2. 2 10.00 – 10.30 Introduction & Expectations 10.30 – 11.00 What is Big Data? Big Data & IT Challenges Automation 11.00 – 11.10 Coffee break 11.10 – 11.40 (Big) Data Technologies: Databases: (No)SQL Program Big Data 11.40 – 11.50 Coffee break 11.50 – 12.20 Hadoop (/Spark) Platform Examples & Demo’s 12.20 – 13.00 Lunch 13.00 – 14.00 Datacenter tour
  • 4. 4 “data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.” (Oxford English Dictionary, 2014) “an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.” (Wikipedia, 2014). “datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze,” (McKinsey, 2011) Definitions I Big Data
  • 5. 5 Volume Tiered storage/hub & spoke Selective data retention Statistical sampling Redundancy elimination Offload “cold” data Outsourcing Velocity Operational data stores Data caches Point-to-point data routing Balance data latency with decision cycles Definitions II Big Data Variety Inconsistency resolution XML-based “universal” translation Application-aware EAI adapters Data access middleware and ETLM Distributed query management Metadata management (Laney, 2001)
  • 6. 6 Data sources Big Data Source: Hortonworks & Teradata/ Vesselhead.com
  • 7. 7 The Data Revolution Titel van de presentatie Source: https://www.youtube.com/watch?v=LrNlZ7-SMPk
  • 8. 8 Volume Big Data Source: EMC/ IDC 40-45% Yearly growth in data volume
  • 9. 9 Connected devices I Big Data Source: HBR/ BI Intelligence
  • 10. 10 Connected devices II Big Data Source: http://chipestimate.com Source: http://motherboard.vice.com
  • 12. 12Big Data Big Data Advanced Analytics Source: Gartner/ Data Science Central
  • 13. 13 Social, Mobile, Analytics & Cloud (SMAC) Big Data Source: Cognizant: Don’t get SMACked
  • 16. 16 A new approach I Big Data Source: Datasciencecentral.com
  • 17. 17 Business Intelligence vs. Data Science Big Data Source: EMC
  • 18. 18 Improving Return-on-Investment Big Data Source: http://www.threedeedigital.com/big-data-analytics-customer-acquisition-strategy/
  • 20. 20 Technical Requirements Big Data Analytics Platform Volume • More data = more information • Large scale data processing • Linear scalable • Broadband network • High I/O throughput Variety • Combining sources, -> (heavy) ETL • Multiple databases/ -stores • Modern Application platform • Expanded ‘toolkit’ • Modulair configurable • Rapid Platform Development • Storage options Velocity • Realtime data processing • High performance compute • Excellent network connectivity • Compute & Memory intensive
  • 22. 22 Challenge: Lots of applications Big Data Source: Datafloq
  • 23. 23 Challenge: Data Storage Big Data Datasilos Structured. Well organized, but incomplete. Datalakes “Put it all in Hadoop or some big NoSQL database” RicePaddies Structured & Unstructured data in different places. “Datalakes in silos”
  • 24. 24 Challenge: Streaming, The Apps, Data & Analytics cycle Big Data
  • 25. 25 Data (-Storage, -Streams, -Analytics) Capabilities Big Data Source: Rob Winters “ Billions of Rows, Millions of Insights Right now”
  • 26. 26 Challenges Overview Big Data Secure Data Governance Accessible Ease of use Data Driven Capturing new business Improving the business Architecture Capturing concerns & plan Removing “Barriers” Technology Solutions Volume, Velocity & Variety Adaptive Continuous change Rapid Development Tooling & Process Elastic scalable Application agnostic Adjustable resources Abstract complexity Lots of self-service Design for scalability Multi vendor + exit-plan Check & verify Automation is critical API based
  • 28. 28 Pets, Cattle & Chicken Big Data Pets: pussinboots Build to specs & Maintain Traditional Enterprise IT Cattle: node72 Deploy, Run, Add/Delete, & Update Largescale Data Processing Chicken: application[…] Containerized Apps Lightweight & Stateless Elastic scalable applications
  • 29. 29 Pets Big Data The traditional server Build to fulfil a particular task Failing systems get healed ASAP Single point(s)-of-failure Periodic downtime inevitable Typically managed manually (sometimes assisted by scripts) Domain of the sys-admin
  • 30. 30 Cattle Big Data Just another node in a network No single-point(s)-of-failure Rolling upgrades Downtime a thing from the past Failing systems get deleted Managed by automation Domain of the system (automation) engineers
  • 31. 31 Chicken Big Data Applications only Inherits characteristics from cattle Added abstraction Added efficiency However: N/A for the data- platform itself Domain of the Developer
  • 35. 35 Continuous (a.k.a. Rapid) Development Big Data Rebuild Playscripts Run Platform Improve
  • 37. 37 Lambda: a need for speed Big Data Source: YMC.ch
  • 38. 38 Big Data: Platform Layout(s) Big Data
  • 40. 40 Reliability over unreliable infrastructure Big Data Source: The Bosting Consulting Group Twisted pair Coaxial Fiber Spectrum Ethernet PPP CDMA IEEE 802 IP TCP UDP HTTP SMTP RTP Browser E-mail VOIP client Innovation Experimentation Personalization Scale Utilization “End-to-end Principle” “End-to-end Principle” ... ... ... ... ...
  • 43. 43Big Data (source: tomitspro.com) Storing Data NoSQL Traditional databases
  • 45. 45 RDBMS Big Data Relational DataBase Management System - Data is structured in database tables, fields and records. - Each table consists of database table rows. - Each database table row consists of one or more table fields.
  • 47. 47 Volume More data, less performance And the 3V’s? Big Data Velocity Ok when either Read (OLAP) or Write (OLTP) But not both Variety Nope, just strictly structured Keep (re-)modeling Other pros and cons of an RDBMS: ‘Normalization’ divides logically clustered data Very good querying with SQL Easy to understand and work with thanks to strict structures
  • 49. 49 Partitioning Big Data Partition (database) (From Wikipedia, the free encyclopedia) A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons.
  • 50. 50 Sharding Big Data Shard (database architecture) (From Wikipedia, the free encyclopedia) A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load.
  • 52. 52 key-value(NoSQL) Big Data Collection of key/value pairs where key is unique identifier, and value is an arbitrary piece of data. (Source: smalsrech.be)
  • 53. 53 Volume Easy sharding as everything is about the key And the 3V’s? Big Data Velocity Updating not easy as all data for a single key are usually overwritten Variety You can go nuts, data is just a ‘blob’ It is all up to the user Other pros and cons of a key-value NoSQL solution: Ideal for short-lived data Often support for auto TTL (Time-To-Live) Very fast as most data (only) lives in memory No ‘querying’ or searching, just keys
  • 54. 54 document(NoSQL) Big Data Similar to key/value store, only values are documents with implicit schematic structure. (Source: smalsrech.be)
  • 55. 55 Volume Key based makes sharding easier And the 3V’s? Big Data Velocity Updating possible (documents have a structure) but not easy Variety You could go nuts, but… Querying expects common elements in structures Other pros and cons of a NoSQL document store: Freedom of structure for documents Support for versioning of documents Query performance really depends on (lack of) Variety
  • 56. 56 column(NoSQL) Big Data Associates keys with sets (families) of columns that provide structure to the model for optimal distribution of data. (Source: smalsrech.be)
  • 57. 57 Volume Column families help sharding (partitioning) And the 3V’s? Big Data Velocity Updating possible thanks to families Variety Some freedom within a column family but overall structure is fixed Other pros and cons of a NoSQL column store: Column families can be used to exploit data locality Complexity of designs
  • 58. 58 graph(NoSQL) Big Data Emphasizes the relationships between items through a flexible ‘web’ as opposed to rigid structure. (Source: smalsrech.be)
  • 59. 59 Volume Just relations, so when/why/what in which shard? And the 3V’s? Big Data Velocity Easy, just reroute relations Variety You could go nuts, but… Maintaining meaningful relationships requires some thoughtOther pros and cons of a NoSQL graph store: Easy insight into adjacent ‘documents’ Fits very well for linked-data (social) Not an easy concept
  • 61. 61Big Data Build on Apache Lucene (part of Apache Solr project) Primarily focussed on text searching (using Lucene) But also a JSON Document store With auto indexing based on document structure And good querying options including proximity and ranges
  • 62. 62 Volume Predefined indexes work well for sharding And the 3V’s? Big Data Velocity Not fast but ok thanks to JSON structures Not intended for Velocity Variety Some, as long as documents are JSON Predefined indexes Other pros and cons of Elastic Search: Getting the expected text-search result is not as easy as it looks Support for versioning and TTL Very powerful for human language related operations
  • 63. 63 So what should we use? Big Data
  • 66. 66 It is all about knowing your (big) data. Big Data
  • 68. 68 Platform Approach Big Data Business Users · Seek functional solution for a particular “job” · Formulate the question IT solution IT · Defining requirements · Technical feasibility · Translation to technical design · Build process · Integrate IT platform · Build for specific “jobs” · Value-driven · Rich in functionality Business Users Business · Decides to use particular functionality, or not · Explores its uses · Subscription based Traditional Specification driven Traditional Specification driven Cloud Functionality driven Cloud Functionality driven Ideal for (Big) Data Analytics!
  • 69. 69 Platform perspective Big Data ElasticCloud Big Data Analytics Hadoop, Spark, NoSQL Connectivity resource Compute resource Storage resource Backup resource Deployment service Monitoring service Security service Services Capacity resources Platform ... Cloudstudio Hosting & Development
  • 70. 70Strategy Outlook 2015 – 2017 Challenges: Development: short cycles, create, deploy, delete Customer: Expectation is instant Functionality driven, standardised & simplified (images, recipes & services) Elasticity: Amazon-like Cloud computing (“ability to adapt to workload changes by provisioning and de-provisioning resources in an autonomic manner”) Why Elastic Cloud?
  • 71. 71 Big Data Analytics Platform: Node layout Big Data
  • 76. 76 Millionsongs Hadoop Fundamentals A = LOAD '/datasets/millionsongs-subset.csv' using PigStorage('#'); B = FOREACH A GENERATE $0 as artist, $1 as title, $8 as location, $49 as tempo; C = FILTER B BY NOT(tempo matches '.*,.*'); D = ORDER C BY tempo ASC; dump D; (Gian Marco,Te Mentiría,Peru,99.983999999999995) (Rick Astley,Nature Boy,Newton-le-Willows, Merseyside, England,99.984999999999999) (Joe McBride,All In,,99.989999999999995) (Shaggy,Criteria,Kingston, Jamaica,99.994) E = GROUP C BY location; F = FOREACH E GENERATE group, COUNT(C) as count, $1; G = FILTER F BY count > 10; H = FOREACH G GENERATE group, AVG(C.tempo) as result; I = ORDER H BY result ASC; dump I; (Boston, MA,135.11847619047617) (Louisiana,135.23941666666667) (Houston, TX,136.4752) (Buenos Aires, Argentina,137.97354545454547) (Gainesville, FL,144.8068235294118) J = FILTER C BY (location matches '.*Gainesville.*'); H = ORDER J BY artist ASC; dump H; (Less Than Jake,Short On Ideas / One Last Cigarette,Gainesville, FL,170.41200000000001) (Sister Hazel,Come Around (Acoustic),Gainesville, FL,195.63300000000001)
  • 79. Improve continuously Add Functionality “Building Platforms & Components” Help wanted! The road ahead