SlideShare a Scribd company logo
Big Data
Big Data: an introduction
Dr. ir. ing. Bart Vandewoestyne
Sizing Servers Lab, Howest, Kortrijk
March 28, 2014
1 / 51
Big Data
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
2 / 51
Big Data
Introduction: Big Data?
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
3 / 51
Big Data
Introduction: Big Data?
Exponential growth of data
© 2013 International Business Machines Corporation 4
Big Data: This is just the beginning
2010
VolumeinExabytes
9000
8000
7000
6000
5000
4000
3000
2015
Percentage of uncertain data
Percentofuncertaindata
100
80
60
40
20
0
You are here
Sensors
& Devices
VoIP
Enterprise
Data
Social
Media
4 / 51
Big Data
Introduction: Big Data?
Big Data definition
Definition of Big Data depends on who you ask:
Big Data
“Multiple terabytes or petabytes.”
(according to some professionals)
“I don’t know.”
(today’s big may be tomorrow’s normal)
“Relative to its context.”
5 / 51
Big Data
Introduction: Big Data?
Quotes on Big Data
“Big data” is a subjective label attached to situations in
which human and technical infrastructures are unable to
keep pace with a company’s data needs.
It’s about recognizing that for some problems other
storage solutions are better suited.
6 / 51
Big Data
Introduction: Big Data?
The Three V’s
Volume The amount of data is big.
Variety Different kinds of data:
structured
semi-structured
unstructured
Velocity Speed-issues to consider:
How fast is the data available for analysis?
How fast can we do something with it?
Other V’s: Veracity, Variability, Validity, Value,. . .
7 / 51
Big Data
Introduction: Big Data?
Structured data
Structured data
Pre-defined schema imposed on the data
Highly structured
Usually stored in a relational database system
Example
numbers: 20, 3.1415,. . .
dates: 21/03/1978
strings: ”Hello World”
. . .
Roughly 20% of all data out there is structured.
8 / 51
Big Data
Introduction: Big Data?
Semi-structured data
Semi-structured data
Inconsistent structure.
Cannot be stored in rows and tables in a typical database.
Information is often self-describing (label/value pairs).
Example
XML, SGML,. . .
BibTeX files
logs
tweets
sensor feeds
. . .
9 / 51
Big Data
Introduction: Big Data?
Unstructured data
Definition (Unstructured data)
Lacks structure or parts of it lack structure.
Example
multimedia: videos, photos,
audio files,. . .
email messages
free-form text
word processing documents
presentations
reports
. . .
Experts estimate that 80 to 90 % of the data in any
organization is unstructured.
10 / 51
Big Data
Introduction: Big Data?
Data Storage and Analysis
Storage capacity of hard drives has increased massively over
the years.
Access speeds have not kept up.
Example (Reading a whole disk)
Year Storage Capacity Transfer Speed Time
1990 1370 MB 4.4 MB/s ≈ 5 minutes
2010 1 TB 100 MB/s > 2.5 hours
Solution: work in parallel!
Using 100 drives (each holding 1/100th of the data),
reading 1 TB takes less than 2 minutes.
11 / 51
Big Data
Introduction: Big Data?
Working in parallel
Problems
1 Hardware failure?
2 Combining data from different disks for analysis?
Solutions
1 HDFS: Hadoop Distributed Filesystem
2 MapReduce: programming model
12 / 51
Big Data
Big Data Technology
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
13 / 51
Big Data
Big Data Technology
Big Data Landscape
14 / 51
Big Data
Big Data Technology
Hadoop
Hadoop is VMware, but the other way around.
15 / 51
Big Data
Big Data Technology
Hadoop as the opposite of a virtual machine
VMware
1 take one physical server
2 split it up
3 get many small virtual
servers
Hadoop
1 take many physical servers
2 merge them all together
3 get one big, massive, virtual
server
16 / 51
Big Data
Big Data Technology
Hadoop: core functionality
HDFS Self-healing, high-bandwidth, clustered storage.
MapReduce Distributed, fault-tolerant resource management,
coupled with scalable data processing.
17 / 51
Big Data
Big Data Technology
HDFS architecture
18 / 51
Big Data
Big Data Technology
MapReduce
19 / 51
Big Data
Big Data Technology
MapReduce
20 / 51
Big Data
Big Data Technology
Apache Hadoop essentials: technology stack
21 / 51
Big Data
Big Data Technology
Pig
MapReduce requires programmers
think in terms of map and reduce
functions,
more than likely use the Java language.
Pig provides a high-level language (Pig
Latin) that can be used by
Analysts
Data Scientists
Statisticians
Etc. . .
22 / 51
Big Data
Big Data Technology
Hive
Originated at Facebook to analyze log data.
HiveQL: Hive Query Language, similar to standard SQL.
Queries are compiled into MapReduce jobs.
Has command-line shell, similar to e.g. MySQL shell.
23 / 51
Big Data
Big Data Technology
Example Hadoop distributions
24 / 51
Big Data
Big Data Technology
NoSQL
25 / 51
Big Data
Big Data Technology
RDBMS: Codd’s 12 rules
Codd’s 12 rules
A set of rules designed to define what is required from a database
management system in order for it to be considered relational.
Rule 0 The Foundation rule
Rule 1 The Information rule
Rule 2 The guaranteed access rule
Rule 3 Systematic treatment of null values
Rule 4 Active online catalog based on the relational model
. . . . . .
26 / 51
Big Data
Big Data Technology
ACID
ACID
A set of properties that guarantee that database transactions are
processed reliably.
Atomicity A transaction is all or nothing.
Consistency Only transactions with valid data.
Isolation Simultaneous transactions will not interfere.
Durability Written transaction data stays there “forever”
(even in case of power loss, crashes, errors,. . . ).
27 / 51
Big Data
Big Data Technology
Scaling up
What if you need to scale up your RDBMS in terms of
dataset size,
read/write concurrency?
This usually involves
breaking Codds rules,
loosening ACID restrictions,
forgetting conventional DBA wisdom,
loose most of the desirable properties that made RDBMS so
convenient in the first place.
NoSQL to the rescue!
28 / 51
Big Data
Big Data Technology
NoSQL
NoSQL
‘Invented’ by Carl Strozzi in 1998 (for his file-based database)
“Not only SQL”
It’s NOT about
saying that SQL should never be used,
saying that SQL is dead.
29 / 51
Big Data
Big Data Technology
NoSQL databases
Four emerging NoSQL categories:
30 / 51
Big Data
Big Data Technology
Us the right tool for the right job!
http://db-engines.com/
31 / 51
Big Data
Big Data in my company?
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
32 / 51
Big Data
Big Data in my company?
Typical RDBMS scaling story
1. Initial Public Launch
From local workstation → remotely hosted MySQL instance.
2. Service popularity ↑, too many reads hitting the database
Add memcached to cache common queries. Reads are now no
longer strictly ACID; cached data must expire.
3. Popularity ↑↑, too many writes hitting the database
Scale MySQL vertically by buying a beefed-up server:
16 cores
128 GB of RAM
banks of 15 k RPM hard drives



Costly
33 / 51
Big Data
Big Data in my company?
Typical RDBMS scaling story
4. New features → query complexity ↑, now too many joins
Denormalize your data to reduce joins.
(Thats not what they taught me in DBA school!)
5. Rising popularity swamps the server; things are too slow
Stop doing any server-side computations.
34 / 51
Big Data
Big Data in my company?
Typical RDBMS scaling story
6. Some queries are still too slow
Periodically prematerialize the most complex queries, and try to
stop joining in most cases.
7. Reads are OK, writes are getting slower and slower. . .
Drop secondary indexes and triggers (no indexes?).
If you stay up at night
worrying about your database
(uptime, scale, or speed), you
should seriously consider
making a jump from the
RDBMS world to HBase.
35 / 51
Big Data
Big Data in my company?
Use-cases of Big Data
‘Core Big Data’ company
Big Data
crunching,
hacking,
processing,
analyzing,
. . .
‘General Big Data’ company
Business Analytics
improve decision-making,
gain operational insights,
increase overall
performance,
track and analyze
shopping patterns,
. . .
Both
Explore! Discover hidden gems!
36 / 51
Big Data
Big Data in my company?
Some examples
Intrusion detection based on
server log data
Real-time security analytics
Fraud detection
Customer behavior based
sentiment analysis of social
media
Campaign analytics
37 / 51
Big Data
Big Data in my company?
Big Data in your company
38 / 51
Big Data
IWT TETRA project
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
39 / 51
Big Data
IWT TETRA project
IWT TETRA project
Data mining: van relationele database naar Big Data.
Dates
Submitted: 12/03/2014
Notification of acceptance: July, 2014
Runs from 01/10/2014 – 01/10/2016
People involved
Wannes De Smet (researcher)
Bart Vandewoestyne (researcher)
Johan De Gelas (project coordinator)
Interested? → Come talk to us!
40 / 51
Big Data
IWT TETRA project
Project plan, work packages
RDBMS vs.
Distributed
Processing
Technology
Choice
MapReduce &
Alternatives
Big Data
Stack
Analysis
BI
Optimization
Distributed
Processing
Optimization Infrastructure
& Cloud
Analysis
Dissemination
41 / 51
Big Data
IWT TETRA project
WP1: RDBMS vs. Distributed Processing
Key question
When to switch from a ‘traditional’ technology to ‘Big Data’
technology?
Evaluate traditional database systems (Virtuoso, VoltDB,. . . )
Find their limitations.
Strengths? Weaknesses?
42 / 51
Big Data
IWT TETRA project
WP2: Analyse Big Data technology stack
Key idea
Get acquinted with Hadoop and its most important software
components.
Find best way to setup, administer and use Hadoop.
Get familiar with most important software components (Pig,
Hive, HBase,. . . ).
Find out how easy it is to integrate Hadoop into existing
architectures.
43 / 51
Big Data
IWT TETRA project
WP3: Alternatives for MapReduce
Key question
What are valuable alternatives for MapReduce?
Faster querying (compared to Pig & Hive)
Lightning-fast cluster computing
Distributed and fault-tolerant realtime computation
Apache Storm
44 / 51
Big Data
IWT TETRA project
WP4: BI optimization
Key questions
Where can existing BI solutions be optimized?
How can current BI solution interact with Big Data
technology?
Virtuoso, MS SQL
Server 2014, VoltDB,. . .
Apache Sqoop
45 / 51
Big Data
IWT TETRA project
WP5: Distributed Processing optimization
Key question
Where can Big Data technology be performance tuned?
How is the data stored?
Optimal settings for Hadoop, MapReduce,. . .
Benchmarks such as TestDFSIO, TeraSort, NNBench,
MRBench,. . .
46 / 51
Big Data
IWT TETRA project
WP6: Infrastructure & Cloud analysis
Key question
What hardware best fits the (Big Data) needs?
Perform hardware monitoring.
Analyze cloud solutions.
Formulate best practices.
Give advice on hardware choice.
47 / 51
Big Data
IWT TETRA project
WP7: Dissemination & project follow-up
Key idea
Spread the message!
Document case-studies.
Prepare for education.
Presentations at events.
Blogs, articles,. . .
Workshops
48 / 51
Big Data
Conclusions
Outline
1 Introduction: Big Data?
2 Big Data Technology
3 Big Data in my company?
4 IWT TETRA project
5 Conclusions
49 / 51
Big Data
Conclusions
Conclusions
“Big” can be small too.
The Big Data landscape is huge.
The right tool for the right job!
We can help → advice, case studies
Your company can benefit from Big Data technology.
Be brave in your quest. . .
50 / 51
Big Data
Conclusions
Questions?
Questions?
johan@sizingservers.be
wannes@sizingservers.be
bart@sizingservers.be
51 / 51

More Related Content

What's hot

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
Guido Schmutz
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Data Science London
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
Prashant Gupta
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
Srinath Perera
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
humerashaziya
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
DataWorks Summit
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Simplilearn
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Global Business Solutions SME
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
Sivashankar Ganapathy
 
Unit 2
Unit 2Unit 2
Big data visualization
Big data visualizationBig data visualization
Big data visualization
Anurag Gupta
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
Eugene Hanikblum
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
Benjamin Bengfort
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
8840VinayShelke
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
James Serra
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
Knoldus Inc.
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
Khalid Imran
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
Seta Wicaksana
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
Databricks
 

What's hot (20)

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?Big Data [sorry] & Data Science: What Does a Data Scientist Do?
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 MillionHow One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
How One Company Offloaded Data Warehouse ETL To Hadoop and Saved $30 Million
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Unit 2
Unit 2Unit 2
Unit 2
 
Big data visualization
Big data visualizationBig data visualization
Big data visualization
 
NoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and WhereNoSQL Graph Databases - Why, When and Where
NoSQL Graph Databases - Why, When and Where
 
A Primer on Entity Resolution
A Primer on Entity ResolutionA Primer on Entity Resolution
A Primer on Entity Resolution
 
Introduction to HADOOP.pdf
Introduction to HADOOP.pdfIntroduction to HADOOP.pdf
Introduction to HADOOP.pdf
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Big Data Testing Strategies
Big Data Testing StrategiesBig Data Testing Strategies
Big Data Testing Strategies
 
Big Data Technology Stack : Nutshell
Big Data Technology Stack : NutshellBig Data Technology Stack : Nutshell
Big Data Technology Stack : Nutshell
 
Understanding big data and data analytics big data
Understanding big data and data analytics big dataUnderstanding big data and data analytics big data
Understanding big data and data analytics big data
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
What’s New with Databricks Machine Learning
What’s New with Databricks Machine LearningWhat’s New with Databricks Machine Learning
What’s New with Databricks Machine Learning
 

Viewers also liked

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Karan Desai
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
Edzo Botjes
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
Richard Vidgen
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
Lars Marius Garshol
 
Eliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, ForeverEliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, Forever
spectralogic
 
Data Protection in Big Data world (EDW lighting talk)
Data Protection in Big Data world (EDW lighting talk)Data Protection in Big Data world (EDW lighting talk)
Data Protection in Big Data world (EDW lighting talk)Castlebridge Associates
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopEdureka!
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Haluan Irsad
 
Big data ppt
Big data pptBig data ppt
Big data ppt
IDBI Bank Ltd.
 
Big Data Processing in the Cloud: A Hydra/Sufia Experience
Big Data Processing in the Cloud: A Hydra/Sufia ExperienceBig Data Processing in the Cloud: A Hydra/Sufia Experience
Big Data Processing in the Cloud: A Hydra/Sufia Experience
rotated8
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
National Information Standards Organization (NISO)
 
Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
Eric Rodriguez (Hiring in Lex)
 
Big Data and Mobile Commerce - Privacy and Data Protection
Big Data and Mobile Commerce - Privacy and Data ProtectionBig Data and Mobile Commerce - Privacy and Data Protection
Big Data and Mobile Commerce - Privacy and Data ProtectionKenneth Ho
 
Taming Big Data with NoSQL
Taming Big Data with NoSQLTaming Big Data with NoSQL
Taming Big Data with NoSQL
Basho Technologies
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
Vipin Batra
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
Ian Foster
 
Big data vccorp
Big data vccorpBig data vccorp
Big data vccorp
Tuan Hoang
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
Uwe Printz
 

Viewers also liked (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - SogetiBig data introduction - Big Data from a Consulting perspective - Sogeti
Big data introduction - Big Data from a Consulting perspective - Sogeti
 
Introduction to big data
Introduction to big dataIntroduction to big data
Introduction to big data
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Eliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, ForeverEliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, Forever
 
Data Protection in Big Data world (EDW lighting talk)
Data Protection in Big Data world (EDW lighting talk)Data Protection in Big Data world (EDW lighting talk)
Data Protection in Big Data world (EDW lighting talk)
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
Whatisbigdataandwhylearnhadoop
WhatisbigdataandwhylearnhadoopWhatisbigdataandwhylearnhadoop
Whatisbigdataandwhylearnhadoop
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Big Data Processing in the Cloud: A Hydra/Sufia Experience
Big Data Processing in the Cloud: A Hydra/Sufia ExperienceBig Data Processing in the Cloud: A Hydra/Sufia Experience
Big Data Processing in the Cloud: A Hydra/Sufia Experience
 
Sept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the CloudSept 24 NISO Virtual Conference: Library Data in the Cloud
Sept 24 NISO Virtual Conference: Library Data in the Cloud
 
Big Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique BruxellesBig Data introduction - Café Numérique Bruxelles
Big Data introduction - Café Numérique Bruxelles
 
Big Data and Mobile Commerce - Privacy and Data Protection
Big Data and Mobile Commerce - Privacy and Data ProtectionBig Data and Mobile Commerce - Privacy and Data Protection
Big Data and Mobile Commerce - Privacy and Data Protection
 
Taming Big Data with NoSQL
Taming Big Data with NoSQLTaming Big Data with NoSQL
Taming Big Data with NoSQL
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Taming Big Data!
Taming Big Data!Taming Big Data!
Taming Big Data!
 
Big data vccorp
Big data vccorpBig data vccorp
Big data vccorp
 
Hadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduceHadoop 2 - Going beyond MapReduce
Hadoop 2 - Going beyond MapReduce
 

Similar to Big Data: an introduction

Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Bart Vandewoestyne
 
Big Data
Big DataBig Data
Big Data
NGDATA
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
Bart Vandewoestyne
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big DataJean-Marc Desvaux
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
DataWorks Summit
 
Big data management
Big data managementBig data management
Big data management
zeba khanam
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
Xavier Constant
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
Prof.Balakrishnan S
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
Mihai Criveti
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Denodo
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
Nagarjuna D.N
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
Denodo
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
Sreedhar Chowdam
 
Big data and you
Big data and you Big data and you
Big data and you
IBM
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
NetApp
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
DATAVERSITY
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
Rajesh Kumar
 
BIG DATA
BIG DATABIG DATA
BIG DATA
Shashank Shetty
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
Stuart Miniman
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
Denodo
 

Similar to Big Data: an introduction (20)

Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Big Data
Big DataBig Data
Big Data
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
Introduction to Big Data An analogy between Sugar Cane & Big Data
Introduction to Big Data An analogy  between Sugar Cane & Big DataIntroduction to Big Data An analogy  between Sugar Cane & Big Data
Introduction to Big Data An analogy between Sugar Cane & Big Data
 
Deutsche Telekom on Big Data
Deutsche Telekom on Big DataDeutsche Telekom on Big Data
Deutsche Telekom on Big Data
 
Big data management
Big data managementBig data management
Big data management
 
Big data presentation (2014)
Big data presentation (2014)Big data presentation (2014)
Big data presentation (2014)
 
Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19Big Data Driven Solutions to Combat Covid' 19
Big Data Driven Solutions to Combat Covid' 19
 
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
DevOps for Data Engineers - Automate Your Data Science Pipeline with Ansible,...
 
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
Logical Data Lakes: From Single Purpose to Multipurpose Data Lakes (APAC)
 
Introduction to Cloud computing and Big Data-Hadoop
Introduction to Cloud computing and  Big Data-HadoopIntroduction to Cloud computing and  Big Data-Hadoop
Introduction to Cloud computing and Big Data-Hadoop
 
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
DAMA & Denodo Webinar: Modernizing Data Architecture Using Data Virtualization
 
Big Data Analytics
Big Data AnalyticsBig Data Analytics
Big Data Analytics
 
Big data and you
Big data and you Big data and you
Big data and you
 
Exploring the Wider World of Big Data
Exploring the Wider World of Big DataExploring the Wider World of Big Data
Exploring the Wider World of Big Data
 
DataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data ArchitectureDataOps - The Foundation for Your Agile Data Architecture
DataOps - The Foundation for Your Agile Data Architecture
 
Big data data lake and beyond
Big data data lake and beyond Big data data lake and beyond
Big data data lake and beyond
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Big data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You WantBig data? No. Big Decisions are What You Want
Big data? No. Big Decisions are What You Want
 
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End UsersFrom Single Purpose to Multi Purpose Data Lakes - Broadening End Users
From Single Purpose to Multi Purpose Data Lakes - Broadening End Users
 

Recently uploaded

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 

Recently uploaded (20)

一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 

Big Data: an introduction

  • 1. Big Data Big Data: an introduction Dr. ir. ing. Bart Vandewoestyne Sizing Servers Lab, Howest, Kortrijk March 28, 2014 1 / 51
  • 2. Big Data Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 2 / 51
  • 3. Big Data Introduction: Big Data? Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 3 / 51
  • 4. Big Data Introduction: Big Data? Exponential growth of data © 2013 International Business Machines Corporation 4 Big Data: This is just the beginning 2010 VolumeinExabytes 9000 8000 7000 6000 5000 4000 3000 2015 Percentage of uncertain data Percentofuncertaindata 100 80 60 40 20 0 You are here Sensors & Devices VoIP Enterprise Data Social Media 4 / 51
  • 5. Big Data Introduction: Big Data? Big Data definition Definition of Big Data depends on who you ask: Big Data “Multiple terabytes or petabytes.” (according to some professionals) “I don’t know.” (today’s big may be tomorrow’s normal) “Relative to its context.” 5 / 51
  • 6. Big Data Introduction: Big Data? Quotes on Big Data “Big data” is a subjective label attached to situations in which human and technical infrastructures are unable to keep pace with a company’s data needs. It’s about recognizing that for some problems other storage solutions are better suited. 6 / 51
  • 7. Big Data Introduction: Big Data? The Three V’s Volume The amount of data is big. Variety Different kinds of data: structured semi-structured unstructured Velocity Speed-issues to consider: How fast is the data available for analysis? How fast can we do something with it? Other V’s: Veracity, Variability, Validity, Value,. . . 7 / 51
  • 8. Big Data Introduction: Big Data? Structured data Structured data Pre-defined schema imposed on the data Highly structured Usually stored in a relational database system Example numbers: 20, 3.1415,. . . dates: 21/03/1978 strings: ”Hello World” . . . Roughly 20% of all data out there is structured. 8 / 51
  • 9. Big Data Introduction: Big Data? Semi-structured data Semi-structured data Inconsistent structure. Cannot be stored in rows and tables in a typical database. Information is often self-describing (label/value pairs). Example XML, SGML,. . . BibTeX files logs tweets sensor feeds . . . 9 / 51
  • 10. Big Data Introduction: Big Data? Unstructured data Definition (Unstructured data) Lacks structure or parts of it lack structure. Example multimedia: videos, photos, audio files,. . . email messages free-form text word processing documents presentations reports . . . Experts estimate that 80 to 90 % of the data in any organization is unstructured. 10 / 51
  • 11. Big Data Introduction: Big Data? Data Storage and Analysis Storage capacity of hard drives has increased massively over the years. Access speeds have not kept up. Example (Reading a whole disk) Year Storage Capacity Transfer Speed Time 1990 1370 MB 4.4 MB/s ≈ 5 minutes 2010 1 TB 100 MB/s > 2.5 hours Solution: work in parallel! Using 100 drives (each holding 1/100th of the data), reading 1 TB takes less than 2 minutes. 11 / 51
  • 12. Big Data Introduction: Big Data? Working in parallel Problems 1 Hardware failure? 2 Combining data from different disks for analysis? Solutions 1 HDFS: Hadoop Distributed Filesystem 2 MapReduce: programming model 12 / 51
  • 13. Big Data Big Data Technology Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 13 / 51
  • 14. Big Data Big Data Technology Big Data Landscape 14 / 51
  • 15. Big Data Big Data Technology Hadoop Hadoop is VMware, but the other way around. 15 / 51
  • 16. Big Data Big Data Technology Hadoop as the opposite of a virtual machine VMware 1 take one physical server 2 split it up 3 get many small virtual servers Hadoop 1 take many physical servers 2 merge them all together 3 get one big, massive, virtual server 16 / 51
  • 17. Big Data Big Data Technology Hadoop: core functionality HDFS Self-healing, high-bandwidth, clustered storage. MapReduce Distributed, fault-tolerant resource management, coupled with scalable data processing. 17 / 51
  • 18. Big Data Big Data Technology HDFS architecture 18 / 51
  • 19. Big Data Big Data Technology MapReduce 19 / 51
  • 20. Big Data Big Data Technology MapReduce 20 / 51
  • 21. Big Data Big Data Technology Apache Hadoop essentials: technology stack 21 / 51
  • 22. Big Data Big Data Technology Pig MapReduce requires programmers think in terms of map and reduce functions, more than likely use the Java language. Pig provides a high-level language (Pig Latin) that can be used by Analysts Data Scientists Statisticians Etc. . . 22 / 51
  • 23. Big Data Big Data Technology Hive Originated at Facebook to analyze log data. HiveQL: Hive Query Language, similar to standard SQL. Queries are compiled into MapReduce jobs. Has command-line shell, similar to e.g. MySQL shell. 23 / 51
  • 24. Big Data Big Data Technology Example Hadoop distributions 24 / 51
  • 25. Big Data Big Data Technology NoSQL 25 / 51
  • 26. Big Data Big Data Technology RDBMS: Codd’s 12 rules Codd’s 12 rules A set of rules designed to define what is required from a database management system in order for it to be considered relational. Rule 0 The Foundation rule Rule 1 The Information rule Rule 2 The guaranteed access rule Rule 3 Systematic treatment of null values Rule 4 Active online catalog based on the relational model . . . . . . 26 / 51
  • 27. Big Data Big Data Technology ACID ACID A set of properties that guarantee that database transactions are processed reliably. Atomicity A transaction is all or nothing. Consistency Only transactions with valid data. Isolation Simultaneous transactions will not interfere. Durability Written transaction data stays there “forever” (even in case of power loss, crashes, errors,. . . ). 27 / 51
  • 28. Big Data Big Data Technology Scaling up What if you need to scale up your RDBMS in terms of dataset size, read/write concurrency? This usually involves breaking Codds rules, loosening ACID restrictions, forgetting conventional DBA wisdom, loose most of the desirable properties that made RDBMS so convenient in the first place. NoSQL to the rescue! 28 / 51
  • 29. Big Data Big Data Technology NoSQL NoSQL ‘Invented’ by Carl Strozzi in 1998 (for his file-based database) “Not only SQL” It’s NOT about saying that SQL should never be used, saying that SQL is dead. 29 / 51
  • 30. Big Data Big Data Technology NoSQL databases Four emerging NoSQL categories: 30 / 51
  • 31. Big Data Big Data Technology Us the right tool for the right job! http://db-engines.com/ 31 / 51
  • 32. Big Data Big Data in my company? Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 32 / 51
  • 33. Big Data Big Data in my company? Typical RDBMS scaling story 1. Initial Public Launch From local workstation → remotely hosted MySQL instance. 2. Service popularity ↑, too many reads hitting the database Add memcached to cache common queries. Reads are now no longer strictly ACID; cached data must expire. 3. Popularity ↑↑, too many writes hitting the database Scale MySQL vertically by buying a beefed-up server: 16 cores 128 GB of RAM banks of 15 k RPM hard drives    Costly 33 / 51
  • 34. Big Data Big Data in my company? Typical RDBMS scaling story 4. New features → query complexity ↑, now too many joins Denormalize your data to reduce joins. (Thats not what they taught me in DBA school!) 5. Rising popularity swamps the server; things are too slow Stop doing any server-side computations. 34 / 51
  • 35. Big Data Big Data in my company? Typical RDBMS scaling story 6. Some queries are still too slow Periodically prematerialize the most complex queries, and try to stop joining in most cases. 7. Reads are OK, writes are getting slower and slower. . . Drop secondary indexes and triggers (no indexes?). If you stay up at night worrying about your database (uptime, scale, or speed), you should seriously consider making a jump from the RDBMS world to HBase. 35 / 51
  • 36. Big Data Big Data in my company? Use-cases of Big Data ‘Core Big Data’ company Big Data crunching, hacking, processing, analyzing, . . . ‘General Big Data’ company Business Analytics improve decision-making, gain operational insights, increase overall performance, track and analyze shopping patterns, . . . Both Explore! Discover hidden gems! 36 / 51
  • 37. Big Data Big Data in my company? Some examples Intrusion detection based on server log data Real-time security analytics Fraud detection Customer behavior based sentiment analysis of social media Campaign analytics 37 / 51
  • 38. Big Data Big Data in my company? Big Data in your company 38 / 51
  • 39. Big Data IWT TETRA project Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 39 / 51
  • 40. Big Data IWT TETRA project IWT TETRA project Data mining: van relationele database naar Big Data. Dates Submitted: 12/03/2014 Notification of acceptance: July, 2014 Runs from 01/10/2014 – 01/10/2016 People involved Wannes De Smet (researcher) Bart Vandewoestyne (researcher) Johan De Gelas (project coordinator) Interested? → Come talk to us! 40 / 51
  • 41. Big Data IWT TETRA project Project plan, work packages RDBMS vs. Distributed Processing Technology Choice MapReduce & Alternatives Big Data Stack Analysis BI Optimization Distributed Processing Optimization Infrastructure & Cloud Analysis Dissemination 41 / 51
  • 42. Big Data IWT TETRA project WP1: RDBMS vs. Distributed Processing Key question When to switch from a ‘traditional’ technology to ‘Big Data’ technology? Evaluate traditional database systems (Virtuoso, VoltDB,. . . ) Find their limitations. Strengths? Weaknesses? 42 / 51
  • 43. Big Data IWT TETRA project WP2: Analyse Big Data technology stack Key idea Get acquinted with Hadoop and its most important software components. Find best way to setup, administer and use Hadoop. Get familiar with most important software components (Pig, Hive, HBase,. . . ). Find out how easy it is to integrate Hadoop into existing architectures. 43 / 51
  • 44. Big Data IWT TETRA project WP3: Alternatives for MapReduce Key question What are valuable alternatives for MapReduce? Faster querying (compared to Pig & Hive) Lightning-fast cluster computing Distributed and fault-tolerant realtime computation Apache Storm 44 / 51
  • 45. Big Data IWT TETRA project WP4: BI optimization Key questions Where can existing BI solutions be optimized? How can current BI solution interact with Big Data technology? Virtuoso, MS SQL Server 2014, VoltDB,. . . Apache Sqoop 45 / 51
  • 46. Big Data IWT TETRA project WP5: Distributed Processing optimization Key question Where can Big Data technology be performance tuned? How is the data stored? Optimal settings for Hadoop, MapReduce,. . . Benchmarks such as TestDFSIO, TeraSort, NNBench, MRBench,. . . 46 / 51
  • 47. Big Data IWT TETRA project WP6: Infrastructure & Cloud analysis Key question What hardware best fits the (Big Data) needs? Perform hardware monitoring. Analyze cloud solutions. Formulate best practices. Give advice on hardware choice. 47 / 51
  • 48. Big Data IWT TETRA project WP7: Dissemination & project follow-up Key idea Spread the message! Document case-studies. Prepare for education. Presentations at events. Blogs, articles,. . . Workshops 48 / 51
  • 49. Big Data Conclusions Outline 1 Introduction: Big Data? 2 Big Data Technology 3 Big Data in my company? 4 IWT TETRA project 5 Conclusions 49 / 51
  • 50. Big Data Conclusions Conclusions “Big” can be small too. The Big Data landscape is huge. The right tool for the right job! We can help → advice, case studies Your company can benefit from Big Data technology. Be brave in your quest. . . 50 / 51