SlideShare a Scribd company logo
Headline Goes Here
Speaker Name or Subhead Goes Here
Hadoop & Complex (Systems) Research

Thoughts About Large Scale Data Management
GridKa School 2014, Karlsruhe, Germany | 2014-09-02	

Mirko Kämpf | Cloudera
AGENDA
1) Complex Systems Research and Big Data: Challenges
2) The role of Metadata in the Big Data world	

3) Data Management in Hadoop	

4) Cluster Federation using the ETOSHA data catalog
2
COLLECT SOME INFORMATION…
• Do you work interdisciplinary?	

• Do you collect your own data?	

• Do you share data sets often?	

• Do you contribute sub sets of results?	

• Do you integrate multiple data sets?
1: CHALLENGES
• As research projects become more complex, larger groups
need scalable collaborative technology and methods.	

• As soon as datasets become to large it is impossible for them
to be copied between partners. We need data federation
for public and private (anonymous) data.	

• Research results have to be transparent and reproducible. 

This requires not just access to research papers, but also 

access to comprehensive contextual information.
4
5
Source: http://www.idiagram.com/examples/characterisitcs.gif
across types of systems,

across scales, and thus 

across disciplines
many components
dynamic
interactions
giving rise to a
number of"
hierarchical levels
which exhibit
common behaviors
Emergent behavior that connote"
be simply inferred from the behavior "
of the components
CHARACTERISTICS OF BIG DATA
6
Source: http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
IS IT A BIG DATA PROBLEM?
• Share and integrate measured data and simulation results from
multiple sub projects on different scales (space and time) into
one large meta model. 	

• Volume: For better statistical results large datasets are required. 	

• Velocity: Becomes more important in real world real time applications.	

• Variety: New methods lead to evolution of procedures and schema updates.	

• Veracity: Data collection tools using novel methods may not be perfect.	

7
2:THE ROLE OF METADATA
• What can be expected from a data set or: “What is in it?”
• Data set description is provided by metadata.	

• How can I get access to a particular dataset? 	

• legacy systems: FTP, NFS, and HTTP downloads	

• state of the art: cloud storage & distributed file systems (Amazon S3)	

• Location, version and schema is also metadata.
8
DEFNITION: DATA & METADATA
"
from: WIKIPEDIA!
"
Data: Data as an abstract concept can be viewed as the lowest level of
abstraction, 

from which information and then knowledge are derived.

Metadata: Metadata (metacontent) is defined as the data providing information
about 

one or more aspects of the data, such as: 

" - Location on a computer network where the data were created" "
" - Purpose of the data"
" - Time and date of creation"
" - Creator or author of the data"
" - Means of creation of the data"
" - Standards used …

9
SUMMARY"
according to Dr. J. Pomeranz in its Coursera course about metadata:"
"


(1) Metadata is a description about 

" - natural or artificial"
" - physical or digital things"
"
(2) Metadata can be: "
" - descriptive"
" - structural"
" - administrative"
"
(3) Metadata exists on a per item or per collection level"
"
(4) Metadata can be embedded or linked!
"
"
10
Metadata covers multiple aspects of a system.
PROCEDURAL & ENTITY
METADATA
• Entity Metadata is about a
document, a record, a file, a
table, a collection.	

• Procedural Metadata
describe the collection,
generation, filtering, and
transformation procedures.
11
RESEARCH APPROACH &
DATA REPRESENTATION
12
METADATA & CONTEXT
13
METADATA & CONTEXT
well managed 

in Hadoop
partially manageable

with Dublin Core
several monitoring tools
several project management tools
14
3: DATA MANAGEMENT IN HADOOP
• Hadoop is a distributed platform to store and process large data sets in parallel.

• Hadoop offers multiple processing engines:

- traditionally the Map-Reduce engine

- Machine learning with Mahout

- Graph processing with Giraph 

- Spark and GraphX for in memory processing

• Hadoop offers data management capabilities:

- HDFS for simple file storage 

- HBase for random access based on a row key

- SOLR to search in indexed datasets
15
HCATALOG / HIVE
• Define table structure for data stored in	

• simple HDFS files	

• HBase tables	

• Query such “Hive-Tables” also with Impala, Pig & MapReduce	

• Access data sets using the Kite-SDK and within Spark:	

"
" Source: https://gist.github.com/granturing/7201912
16
LIMITATIONS:
• Hive-Metastore contains primarily “technical” metadata.	

• Meaning of a data set is not managed.	

• Context and life cycle of a data set is not actively managed.	

• Processing parameters are not available in core Hadoop.	

• Hive-Metastore is shared, but only for users in one cluster.
17
4: WHAT IS ETOSHA?
• Etosha will be a distributed, decentralized, and fault tolerant
metadata service.	

• Its main purpose is publishing and linking of information 

about datasets. 	

• Etosha enables cluster federation for Hadoop. 	

• Etosha allows knowledge management while HCatalog and
the Hive-Metastore are focused on low-level technical
18
ETOSHA ARCHITECTURE
"
"
"
"
"
"
"
19
Hadoop cluster
once per Hadoop cluster
ETOSHA CONTEXT 

BROWSER
20
Context browser works across clusters.
NEXT STEPS
• Expose administrative, structural, and descriptive metadata

about Hadoop data sets via SPARQL endpoint.	

• Implement a data set discovery API to find relevant data.	

• Enable subquery delegation between Hadoop clusters.	

• Collect and publish metadata about public open datasets 

in the Etosha Data Catalog (EDC) project.
21
MANYTHANKS …	

• … to the organizers of this great event!	

• … to my friend and contributor EricTessenow.	

• … to my colleagues from Cloudera.	

• … and most importantly, to the members of the open source community!
22

More Related Content

What's hot

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
Databricks
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
Spark Summit
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
inside-BigData.com
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
Dorian Beganovic
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
Ayyappan Paramesh
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Sriram Krishnan
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
Asim Jalis
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
Databricks
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Databricks
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
Simplilearn
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
Akshansh Agarwal
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
Sriram Krishnan
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
DataWorks Summit/Hadoop Summit
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
Dataflair Web Services Pvt Ltd
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
Edureka!
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
Sriram Krishnan
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
Shubham Parmar
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
DataWorks Summit/Hadoop Summit
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Edureka!
 

What's hot (20)

A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
Data Science at Scale by Sarah Guido
Data Science at Scale by Sarah GuidoData Science at Scale by Sarah Guido
Data Science at Scale by Sarah Guido
 
Practical introduction to hadoop
Practical introduction to hadoopPractical introduction to hadoop
Practical introduction to hadoop
 
Uber's data science workbench
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
 
Apache Spark in Industry
Apache Spark in IndustryApache Spark in Industry
Apache Spark in Industry
 
Big data Hadoop
Big data  Hadoop   Big data  Hadoop
Big data Hadoop
 
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at ScaleData Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
Data Platform at Twitter: Enabling Real-time & Batch Analytics at Scale
 
Data Engineering Quick Guide
Data Engineering Quick GuideData Engineering Quick Guide
Data Engineering Quick Guide
 
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining w...
 
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
Spark Summit San Francisco 2016 - Matei Zaharia Keynote: Apache Spark 2.0
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Big data & Hadoop
Big data & HadoopBig data & Hadoop
Big data & Hadoop
 
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC ResourcesmyHadoop - Hadoop-on-Demand on Traditional HPC Resources
myHadoop - Hadoop-on-Demand on Traditional HPC Resources
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Performance of Spark vs MapReduce
Performance of Spark vs MapReducePerformance of Spark vs MapReduce
Performance of Spark vs MapReduce
 
Opal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific ApplicationsOpal: Simple Web Services Wrappers for Scientific Applications
Opal: Simple Web Services Wrappers for Scientific Applications
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the ExpertsHadoop in the Cloud – The What, Why and How from the Experts
Hadoop in the Cloud – The What, Why and How from the Experts
 
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
Hadoop vs Spark | Which One to Choose? | Hadoop Training | Spark Training | E...
 

Viewers also liked

DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?
Sailaja Tennati
 
«ПЕРВЫЙ ШАГ» - утешительные поединки
«ПЕРВЫЙ ШАГ» - утешительные поединки«ПЕРВЫЙ ШАГ» - утешительные поединки
SEO Training Workshop Presentation
SEO Training Workshop PresentationSEO Training Workshop Presentation
SEO Training Workshop Presentation
Tim Aldiss
 
R2i Marketo Case Studies
R2i Marketo Case StudiesR2i Marketo Case Studies
R2i Marketo Case Studies
R2integrated
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
Carol McDonald
 
Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44
Jisc
 
Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016
Jisc
 
Multiprotocol label switching (mpls) - Networkshop44
Multiprotocol label switching (mpls)  - Networkshop44Multiprotocol label switching (mpls)  - Networkshop44
Multiprotocol label switching (mpls) - Networkshop44
Jisc
 

Viewers also liked (8)

DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?DevOps – what is it? Why? Is it real? How to do it?
DevOps – what is it? Why? Is it real? How to do it?
 
«ПЕРВЫЙ ШАГ» - утешительные поединки
«ПЕРВЫЙ ШАГ» - утешительные поединки«ПЕРВЫЙ ШАГ» - утешительные поединки
«ПЕРВЫЙ ШАГ» - утешительные поединки
 
SEO Training Workshop Presentation
SEO Training Workshop PresentationSEO Training Workshop Presentation
SEO Training Workshop Presentation
 
R2i Marketo Case Studies
R2i Marketo Case StudiesR2i Marketo Case Studies
R2i Marketo Case Studies
 
Apache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision TreesApache Spark Machine Learning Decision Trees
Apache Spark Machine Learning Decision Trees
 
Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44Managing and monitoring large scale data transfers - Networkshop44
Managing and monitoring large scale data transfers - Networkshop44
 
Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016Introduction to Networkshop - Networkshop44 2016
Introduction to Networkshop - Networkshop44 2016
 
Multiprotocol label switching (mpls) - Networkshop44
Multiprotocol label switching (mpls)  - Networkshop44Multiprotocol label switching (mpls)  - Networkshop44
Multiprotocol label switching (mpls) - Networkshop44
 

Similar to Hadoop & Complex Systems Research

Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
DignitasDigital1
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
maharajothip1
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
eakasit_dpu
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
nallagangus
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
Neeraja Rentachintala
 
Hadoop
HadoopHadoop
Hadoop
RittikaBaksi
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
mlang222
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
Rajesh Jayarman
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
Philippe Julio
 
Big data applications
Big data applicationsBig data applications
Big data applications
Juan Pablo Paz Grau, Ph.D., PMP
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
Bhupesh Bansal
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
Marin Dimitrov
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
Jim Dowling
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
praveen bhat
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
MaharajothiP
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
Dr. Mirko Kämpf
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
EUCLID project
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
ShreyasKv13
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
VishalBH1
 

Similar to Hadoop & Complex Systems Research (20)

Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Overview of Big Data by Sunny
Overview of Big Data by SunnyOverview of Big Data by Sunny
Overview of Big Data by Sunny
 
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for womenHadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
Hadoop Maharajathi,II-M.sc.,Computer Science,Bonsecours college for women
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop - Architectural road map for Hadoop Ecosystem
Hadoop -  Architectural road map for Hadoop EcosystemHadoop -  Architectural road map for Hadoop Ecosystem
Hadoop - Architectural road map for Hadoop Ecosystem
 
Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014Apache Drill at ApacheCon2014
Apache Drill at ApacheCon2014
 
Hadoop
HadoopHadoop
Hadoop
 
Teradata Loom Introductory Presentation
Teradata Loom Introductory PresentationTeradata Loom Introductory Presentation
Teradata Loom Introductory Presentation
 
Big Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RKBig Data Practice_Planning_steps_RK
Big Data Practice_Planning_steps_RK
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data applications
Big data applicationsBig data applications
Big data applications
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Hadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFSHadoop/MapReduce/HDFS
Hadoop/MapReduce/HDFS
 
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
P.Maharajothi,II-M.sc(computer science),Bon secours college for women,thanjavur.
 
Etosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road mapEtosha - Data Asset Manager : Status and road map
Etosha - Data Asset Manager : Status and road map
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Module-2_HADOOP.pptx
Module-2_HADOOP.pptxModule-2_HADOOP.pptx
Module-2_HADOOP.pptx
 
BIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptxBIg Data Analytics-Module-2 vtu engineering.pptx
BIg Data Analytics-Module-2 vtu engineering.pptx
 

More from Dr. Mirko Kämpf

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
Dr. Mirko Kämpf
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
Dr. Mirko Kämpf
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
Dr. Mirko Kämpf
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
Dr. Mirko Kämpf
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
Dr. Mirko Kämpf
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
Dr. Mirko Kämpf
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
Dr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
Dr. Mirko Kämpf
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
Dr. Mirko Kämpf
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
Dr. Mirko Kämpf
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
Dr. Mirko Kämpf
 

More from Dr. Mirko Kämpf (11)

Time Series Analysis Using an Event Streaming Platform
 Time Series Analysis Using an Event Streaming Platform Time Series Analysis Using an Event Streaming Platform
Time Series Analysis Using an Event Streaming Platform
 
IoT meets AI in the Clouds
IoT meets AI in the CloudsIoT meets AI in the Clouds
IoT meets AI in the Clouds
 
Improving computer vision models at scale (Strata Data NYC)
Improving computer vision models at scale  (Strata Data NYC)Improving computer vision models at scale  (Strata Data NYC)
Improving computer vision models at scale (Strata Data NYC)
 
Improving computer vision models at scale presentation
Improving computer vision models at scale presentationImproving computer vision models at scale presentation
Improving computer vision models at scale presentation
 
Enterprise Metadata Integration
Enterprise Metadata IntegrationEnterprise Metadata Integration
Enterprise Metadata Integration
 
PCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System TuningPCAP Graphs for Cybersecurity and System Tuning
PCAP Graphs for Cybersecurity and System Tuning
 
From Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on ScaleFrom Events to Networks: Time Series Analysis on Scale
From Events to Networks: Time Series Analysis on Scale
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4DPG Berlin - SOE 18 - talk v1.2.4
DPG Berlin - SOE 18 - talk v1.2.4
 
Information Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation OptimizationInformation Spread in the Context of Evacuation Optimization
Information Spread in the Context of Evacuation Optimization
 
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
DPG 2014: "Context Sensitive and Time Dependent Relevance of Wikipedia Articles"
 

Recently uploaded

High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
ranjeet3341
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
nitachopra
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
Vineet
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
9gr6pty
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
krishnasrigannavarap
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
blueshagoo1
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
Vietnam Cotton & Spinning Association
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
agdhot
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
nhutnguyen355078
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
Vineet
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
oaxefes
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
sapna sharmap11
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 

Recently uploaded (20)

High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENTHigh Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
High Profile Call Girls Navi Mumbai ✅ 9833363713 FULL CASH PAYMENT
 
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
Call Girls Goa👉9024918724👉Low Rate Escorts in Goa 💃 Available 24/7
 
Senior Engineering Sample EM DOE - Sheet1.pdf
Senior Engineering Sample EM DOE  - Sheet1.pdfSenior Engineering Sample EM DOE  - Sheet1.pdf
Senior Engineering Sample EM DOE - Sheet1.pdf
 
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
一比一原版(uob毕业证书)伯明翰大学毕业证如何办理
 
Health care analysis using sentimental analysis
Health care analysis using sentimental analysisHealth care analysis using sentimental analysis
Health care analysis using sentimental analysis
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
Econ3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdfEcon3060_Screen Time and Success_ final_GroupProject.pdf
Econ3060_Screen Time and Success_ final_GroupProject.pdf
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics March 2024
 
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
一比一原版加拿大麦吉尔大学毕业证(mcgill毕业证书)如何办理
 
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdfOverview IFM June 2024 Consumer Confidence INDEX Report.pdf
Overview IFM June 2024 Consumer Confidence INDEX Report.pdf
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Data Scientist Machine Learning Profiles .pdf
Data Scientist Machine Learning  Profiles .pdfData Scientist Machine Learning  Profiles .pdf
Data Scientist Machine Learning Profiles .pdf
 
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
一比一原版卡尔加里大学毕业证(uc毕业证)如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
Call Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call GirlCall Girls Hyderabad  (india) ☎️ +91-7426014248 Hyderabad  Call Girl
Call Girls Hyderabad (india) ☎️ +91-7426014248 Hyderabad Call Girl
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 

Hadoop & Complex Systems Research

  • 1. Headline Goes Here Speaker Name or Subhead Goes Here Hadoop & Complex (Systems) Research
 Thoughts About Large Scale Data Management GridKa School 2014, Karlsruhe, Germany | 2014-09-02 Mirko Kämpf | Cloudera
  • 2. AGENDA 1) Complex Systems Research and Big Data: Challenges 2) The role of Metadata in the Big Data world 3) Data Management in Hadoop 4) Cluster Federation using the ETOSHA data catalog 2
  • 3. COLLECT SOME INFORMATION… • Do you work interdisciplinary? • Do you collect your own data? • Do you share data sets often? • Do you contribute sub sets of results? • Do you integrate multiple data sets?
  • 4. 1: CHALLENGES • As research projects become more complex, larger groups need scalable collaborative technology and methods. • As soon as datasets become to large it is impossible for them to be copied between partners. We need data federation for public and private (anonymous) data. • Research results have to be transparent and reproducible. 
 This requires not just access to research papers, but also 
 access to comprehensive contextual information. 4
  • 5. 5 Source: http://www.idiagram.com/examples/characterisitcs.gif across types of systems,
 across scales, and thus 
 across disciplines many components dynamic interactions giving rise to a number of" hierarchical levels which exhibit common behaviors Emergent behavior that connote" be simply inferred from the behavior " of the components
  • 6. CHARACTERISTICS OF BIG DATA 6 Source: http://api.ning.com/files/tRHkwQN7s-Xz5cxylXG004GLGJdjoPd6bVfVBwvgu*F5MwDDUCiHHdmBW-JTEz0cfJjGurJucBMTkIUNdL3jcZT8IPfNWfN9/dv1.jpg
  • 7. IS IT A BIG DATA PROBLEM? • Share and integrate measured data and simulation results from multiple sub projects on different scales (space and time) into one large meta model. • Volume: For better statistical results large datasets are required. • Velocity: Becomes more important in real world real time applications. • Variety: New methods lead to evolution of procedures and schema updates. • Veracity: Data collection tools using novel methods may not be perfect. 7
  • 8. 2:THE ROLE OF METADATA • What can be expected from a data set or: “What is in it?” • Data set description is provided by metadata. • How can I get access to a particular dataset? • legacy systems: FTP, NFS, and HTTP downloads • state of the art: cloud storage & distributed file systems (Amazon S3) • Location, version and schema is also metadata. 8
  • 9. DEFNITION: DATA & METADATA " from: WIKIPEDIA! " Data: Data as an abstract concept can be viewed as the lowest level of abstraction, 
 from which information and then knowledge are derived.
 Metadata: Metadata (metacontent) is defined as the data providing information about 
 one or more aspects of the data, such as: 
 " - Location on a computer network where the data were created" " " - Purpose of the data" " - Time and date of creation" " - Creator or author of the data" " - Means of creation of the data" " - Standards used …
 9
  • 10. SUMMARY" according to Dr. J. Pomeranz in its Coursera course about metadata:" " 
 (1) Metadata is a description about 
 " - natural or artificial" " - physical or digital things" " (2) Metadata can be: " " - descriptive" " - structural" " - administrative" " (3) Metadata exists on a per item or per collection level" " (4) Metadata can be embedded or linked! " " 10 Metadata covers multiple aspects of a system.
  • 11. PROCEDURAL & ENTITY METADATA • Entity Metadata is about a document, a record, a file, a table, a collection. • Procedural Metadata describe the collection, generation, filtering, and transformation procedures. 11
  • 12. RESEARCH APPROACH & DATA REPRESENTATION 12
  • 14. METADATA & CONTEXT well managed 
 in Hadoop partially manageable
 with Dublin Core several monitoring tools several project management tools 14
  • 15. 3: DATA MANAGEMENT IN HADOOP • Hadoop is a distributed platform to store and process large data sets in parallel.
 • Hadoop offers multiple processing engines:
 - traditionally the Map-Reduce engine
 - Machine learning with Mahout
 - Graph processing with Giraph 
 - Spark and GraphX for in memory processing
 • Hadoop offers data management capabilities:
 - HDFS for simple file storage 
 - HBase for random access based on a row key
 - SOLR to search in indexed datasets 15
  • 16. HCATALOG / HIVE • Define table structure for data stored in • simple HDFS files • HBase tables • Query such “Hive-Tables” also with Impala, Pig & MapReduce • Access data sets using the Kite-SDK and within Spark: " " Source: https://gist.github.com/granturing/7201912 16
  • 17. LIMITATIONS: • Hive-Metastore contains primarily “technical” metadata. • Meaning of a data set is not managed. • Context and life cycle of a data set is not actively managed. • Processing parameters are not available in core Hadoop. • Hive-Metastore is shared, but only for users in one cluster. 17
  • 18. 4: WHAT IS ETOSHA? • Etosha will be a distributed, decentralized, and fault tolerant metadata service. • Its main purpose is publishing and linking of information 
 about datasets. • Etosha enables cluster federation for Hadoop. • Etosha allows knowledge management while HCatalog and the Hive-Metastore are focused on low-level technical 18
  • 20. ETOSHA CONTEXT 
 BROWSER 20 Context browser works across clusters.
  • 21. NEXT STEPS • Expose administrative, structural, and descriptive metadata
 about Hadoop data sets via SPARQL endpoint. • Implement a data set discovery API to find relevant data. • Enable subquery delegation between Hadoop clusters. • Collect and publish metadata about public open datasets 
 in the Etosha Data Catalog (EDC) project. 21
  • 22. MANYTHANKS … • … to the organizers of this great event! • … to my friend and contributor EricTessenow. • … to my colleagues from Cloudera. • … and most importantly, to the members of the open source community! 22