SlideShare a Scribd company logo
Alexander Hendorf
Databases for Data-Science
Alexander C. S. Hendorf
− Senior Consultant Information Technology
− Program Chair EuroPython, PyConDE & PyData Karlsruhe, 

EuroSciPy, PSF Managing Member
− PyData Frankfurt and PyData Südwest Organiser
− Program Committee Percona Live
− MongoDB Masters / MongoDB Certified DBA
− Speaker Europe & USA MongoDB World New York / San José,
PyCon Italy, CEBIT, BI Forum, IT-Tage FFM, PyData, PyParis,
PyCon UK, Budapest BI,….
ah@koenigsweg.com
@hendorf
Services
Data We guide our clients through development and implementation processes
of technologies and applications to analyse, evaluate and visualize
business data.
Strategy & Operations We advise SME, start-ups and public institutions on building efficient sales
structures, on process optimization and on business development.
Communication We provide sound communication strategies and creative campaigns to
communicate your content and messages authentically throughout all
channels.
Financial Service Technologies Our industry experts support clients in the financial sector in developing
powerful and compliant FinTech applications.
!
Let‘s get to know each other!
» What‘s your background?«
− Data scientist
− Database admin
− Curious Pythonista
− Consultant, decision maker (IT Executive, Consultant, Innovation
Manager, YouTube influencer) 
"
Let‘s get to know each other!
» What‘s your Experience in…«
− RDBSs
− none
− some
− a lot
− Hadoop et al.
− none
− some
− a lot
− NoSQL Systems 

without Hadoop et al.
− none
− some
− a lot
#
Let‘s get to know each other!
» What‘s are you looking for?«
− Integration for data science into existing ecosystems
− Learning about databases for data science projects
− General interest / curiosity
− Just killing time until PyFiorentina
− …?
Three Angles for Databases for Data Science
− Choosing a database for data science projects
− Evaluating an existing database for data science requirements
− How to integrate into an existing ecosystem
$
„How deeply do data science &
data base understand each other?.“
$
Ask Google:
Data Scientist? Database Admin?
%#&'
()*+
,
https://xkcd.com/1838/
What are the Benefits of a Database for Data Science Anyway?
− Common source of data
− Avoiding redundancy (e.g. files)
− Persistence
− Optimized for handling and accessing data for decades
− Scalability
− Staying very close to the data
A Quick Recap on Database History
− 1960s, navigational DBMS (disks & drums)
− 1970s, relational DBMS
− 1980s, on the desktop
− 1990s, object-oriented
− 2000s, NoSQL
Relational Databases
− Records are organised into tables
− Rows of these table are identified by unique keys
− Data spans multiple tables, linked via ids
− Data is ideally normalised
− Data can be denormalized for performance
− Transactions are ACID [Atomic, Consistent , Isolated, Durable]
https://xkcd.com/927/
Relational Databases Benefits
− Widely used and supported
− Normalized data
− Comprehensive querying via SQL language 

- though some differences between databases
− Well researched and optimized over decades
Relational Databases Downsides
− Schemas are fixed and have to be pre-defined upfront (schema-first)
− Altering schema is not trivial
− Joining tables, depending on complexity, data volume may be costly,

also consider overhead understanding a schema with many tables
− Difficult to scale out
− Few data structures (tables, rows)
NoSQL Types
− Key-Value Store

simplest form of a NoSQL database (no big value for data science)
− Document databases (JSON style)

open schema

can handle complex data structures as arrays and list
− Wide column databases, most like relational DBs:

columns are not fixed, data is de-normalised, 

can handle complex data structures as arrays and lists
− Graph

network of connected entities linked by edges with properties, query on properties and links
NoSQL Databases Benefits
− No need to normalise data (schema-later)
− Maintain complex data structures
− Supports data sharding
− New ways to query
− Collections can be copied
NoSQL Databases Downsides
− Eventual Consistency (is this a real problem for data science at all?)
− Flexibility requires more responsibility (schema, attribute typos)
− Complexity
A Quick Bird’s Eye View
There are hundreds of databases around nowadays.
Let’s focus on the top database systems.
https://db-engines.com
NoSQL RDBMS
~10 years 40 years +
Points according to https://db-engines.com/de/ranking
Consistency Models of Databases
− A tomicity
− C onsistency
− I solation
− D urability
− B
− A asic Availability
− S oft-State
− E ventual consistency
Open Source Check
− Security
− Transparency
− Engaging Collaboration
− Quality
− Auditability
− Try Before You Buy (EE)
− Rule of Thumb: 

Open Software is way more
affordable than closed
* via vendors
Open Source Enterprise Editions
Oracle x +
MySQL +-?!? +
MicroSoft SQL Server +
PostgreSQL + +*
DB2 +
MongoDB + +
Redis + +*
Cassandra + +*
HBase + +*
Amazon DynamoDB DAAS
Neo4J + +
The Contenders
Type Chosen
PostgreSQL RDBMS Top OS RDBMS
MongoDB Document-store Top NoSQL (DS)
Cassandra Wide-column store Top NoSQL (WCS)
Neo4J Graph Top NoSQL (Graph)
Relational Database Management System
Dat
aba
se
Cassandra
Graph Database
Gra
ph
https://neo4j.com/developer/guide-importing-data-and-etl/
How Hard is it to Collect Data?
Data Collection, cleaning and
restructuring
Multiple data sources Data retention
PostgreSQL Depends on schema complexity Depends on schema complexity easy
MongoDB easy easy medium
Cassandra easy easy hard
Neo4J (easy) N/A easy
What about Data Types
enforced flexible enforceable
PostgreSQL yes (NoSQL feature) predefined
MongoDB possible yes yes
Cassandra yes untyped collection columns predefined
Neo4J (yes) N/A N/A
How Hard is it to Consolidate Data?
Linking Missing data Dirty data
Persisting cleaned
dataset
PostgreSQL built schema
pre-processing
recommended
pre-processing
recommended
easy
MongoDB easy (within db) flexible post-processing flexible post-processing easy
Cassandra partitioning hard hard easy
Neo4J yes* hard hard easy
How Hard is it to Write Queries Against These Databases?
Language Basic Queries Advanced Queries
PostgreSQL SQL easy hard
MongoDB MQL
query: easy
aggregation: medium
query: medium
aggregation: medium
Cassandra CSQL easy hard
Neo4J Cypher easy hard
How Hard is Querying to Learn?
Language Basic Queries Advanced Queries
PostgreSQL SQL easy hard
MongoDB MQL
query: easy
aggregation: easy
query: medium
aggregation: medium
Cassandra CSQL easy-medium hard
Neo4J Cypher medium hard
SQL Benefits and Downsides
− Common standard
− Long-established
− Mother of many others e.g. CSQL, ABAP,

Pig, SPARQL,…
− Set based logic
− Complexity increases fast
− Badly designed JOINs vs. performance
− Overhead understanding a large schema
− Set based logic
SELECT EmployeeID, FirstName, LastName, HireDate, City

FROM Employees

WHERE HireDate BETWEEN '1-june-1992' AND '15-december-1993'
SQL
SELECT A.SD1, B.ED1 FROM



(SELECT SD1, ROW_NUMBER() OVER (ORDER BY SD1) AS RN1 FROM (SELECT T1.Start_Date AS
SD1, T2.Start_Date AS SD2 FROM (SELECT * FROM Projects ORDER BY Start_Date) T1



LEFT JOIN (SELECT * FROM Projects ORDER BY Start_Date) T2



ON T1.Start_Date=(T2.Start_Date+1)



ORDER BY T1.Start_Date) WHERE SD2 IS NULL) A



INNER JOIN



(SELECT ED1, ROW_NUMBER() OVER (ORDER BY ED1) AS RN2 FROM (SELECT T1.End_Date AS
ED1, T2.Start_Date AS SD2 FROM (SELECT * FROM Projects ORDER BY Start_Date) T1



LEFT JOIN (SELECT * FROM Projects ORDER BY Start_Date) T2



ON T1.End_Date=(T2.Start_Date) ORDER BY T1.Start_Date) WHERE SD2 IS NULL) B



ON A.RN1=B.RN2



ORDER BY (B.ED1-A.SD1), A.SD1;
SELECT * FROM numberOfRequests

WHERE cluster = ‘cluster1’

AND date = ‘2015-06-05’

AND datacenter = 'US_WEST_COAST'

AND (hour, minute) IN ((14, 0), (15, 0));
Cassandra
MATCH (c:Customer {companyName:"Drachenblut Delikatessen"})

OPTIONAL MATCH (p:Product)<-[pu:PRODUCT]-(:Order)<-[:PURCHASED]-(c)

RETURN p.productName, toInt(sum(pu.unitPrice * pu.quantity)) AS volume

ORDER BY volume DESC;
Neo4J
pipeline = [
    {"$match": {"artistName": “Suppenstar"}},
    {"$sort": {„info.releaseDate: 1)])},
    {"$group": {
       "_id": {"$year": "$info.releaseDateEpoch"},
"count": {"$sum": "1}}},
    {"$project": {"year": "$_id.year", "count": 1}}},
]
MongoDB Aggregation Pipeline
−$match
−$sort
−$limit
−$project
−$group
−$unwind
−$lookup
−WHERE | HAVING
−ORDER BY
−LIMIT
−SELECT
−GROUP BY
−(JOIN)
−LEFT OUTER JOIN
Aggregation Pipeline / SQL
How Hard is it to Run?
− Installation
− Maintenance
− Cleaning up
− Compacting
− Backup
− Replica (or continuous)
− File System backup
− Dump
Set-Up Maintenance Backup
PostgreSQL easy medium medium
MongoDB easy low
easy
(Replica / CS)
Cassandra easy intense
easy
(Replica)
Neo4J easy medium easy
How Hard is Run Analytics without Affecting Production Performance?
replica shard
PostgreSQL medium
medium

(if run on overnight backup)
MongoDB easy: hidden replica node
medium: hidden replica node with
shard-key
Cassandra
depends on number of nodes &
partitioning
depends on number of nodes &
partitioning
Neo4J medium medium
Collection
Shard1 Shard2 Shard3 Shard4
server-1 server-2 server-3 server-4
server
CollectionCollection
server server
Replica-Set
horizontal scaling: one primary + copies
Sharding
vertical scaling
split the data across nodes
one server - utilize multiple cpu + IO
"Micro-Sharding"
How Hard is it to Integrate into Existing Systems?
task type
PostgreSQL easy just an additional SQL database
MongoDB easy replica suggests multiple servers
Cassandra medium requires multiple servers
Neo4J easy just an additional database
How Hard is it to Access / Change These Systems (Authorization)?
User Auth Granularity
PostgreSQL Role-Model Field
MongoDB Role-Model Collection level
Cassandra Role-Model Table
Neo4J Fixed Roles Graph
How Hard is it to Add New Data?
Known attributes Unknown (ext.) data
PostgreSQL easy - medium medium - hard PostgreSQL also has a NoSQL feature
MongoDB easy easy
Cassandra easy easy
Neo4J easy N/A data needs to be graph
How Hard is it to Understand the Data Structure?
small system medium system extensive system
PostgreSQL easy medium (partitioned) hard
MongoDB easy easy easy - medium
Cassandra easy hard (highly partitioned) hard (highly partitioned)
Neo4J easy easy easy - medium
RowKey: john
=> (column=, value=, timestamp=1374683971220000)
=> (column=map1:doug, value='555-1579', timestamp=1374683971220000)
=> (column=map1:patricia, value='555-4326', timestamp=1374683971220000)
=> (column=list1:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374683971220000)
=> (column=list1:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374683971220000)
=> (column=set1:'patricia', value=, timestamp=1374683971220000) => (column=set1:'scott', value=,
timestamp=1374683971220000)
Cassandra
https://neo4j.com/blog/oscon-twitter-graph/
Neo4J
Relational Model Document Model
https://neo4j.com/blog/neo4j-2-0-0-m06-introducing-neo4js-browser/
How Hard is it to Handle Growth?
read capacity
PostgreSQL medium advanced
MongoDB easy medium
Cassandra medium medium
Neo4J medium medium
Some More Use Cases for Databases in Data Science
− Storing model parameters (even models)
− Documenting experiments
− Collecting performance metrics of models
− …
Conclusion
− Analyse all your real needs and focus on those
− Chose an accessible, simple solution (don‘t reach out to high)
− Do not only focus on performance
− Try, play and test before making final decisions
− If you have a very specific use case go for the specialist system
− A good choice for general purpose is MongoDB
− If you work only on graphs use Neo4J
− If you have simple tables and know SQL got for a RDBMS
My Advise If You Are New in the Database Space.
− Document store is easy to understand and maintain
− Less querying overhead for multi-dimensional data
− Aggregation pipeline
− Grouping
− $relational (LO JOIN)
− $graphLookup
− Many built-in operators
− Easy install and replicate
− Compressed storage by default, in-memory avail.
− Learning at least Basic SQL and Set Theory is a MUST
− SQLAlchemy if you work with RDBMS
Thanks for Contributing
Databases for Data Science is still an actively discussed topic in the experts‘ community.
This presentation will be constantly updated.
Newer findings and updates will be added.
Stay informed:
Follow me on Twitter @hendorf or LinkedIn
Or drop me an email ah@koenigsweg.com
−Jens Dittrich, Professor bigdata.uni-saarland.de @jensdittrich
ah@koenigsweg.com
@hendorf
Thank you!
Q & A
Databases for Data Science

More Related Content

What's hot

Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
butest
 
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdfCS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
AALIM MUHAMMED SALEGH COLLEGE OF ENGINEERING
 
Coursera Machine Learning (by Andrew Ng)_강의정리
Coursera Machine Learning (by Andrew Ng)_강의정리Coursera Machine Learning (by Andrew Ng)_강의정리
Coursera Machine Learning (by Andrew Ng)_강의정리
SANG WON PARK
 
Decision Tree in Machine Learning
Decision Tree in Machine Learning  Decision Tree in Machine Learning
Decision Tree in Machine Learning
Souma Maiti
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
Charmi Chokshi
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
Dr.Shweta
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
Kuppusamy P
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
Kazuki Yoshida
 
Elhabian lda09
Elhabian lda09Elhabian lda09
Elhabian lda09
Mr. Neelamegam D
 
Curso CSS 3 - Aula Introdutória com conceitos básicos
Curso CSS 3 - Aula Introdutória com conceitos básicosCurso CSS 3 - Aula Introdutória com conceitos básicos
Curso CSS 3 - Aula Introdutória com conceitos básicos
Tiago Antônio da Silva
 
Estrutura de Dados - Aula 03 - Ponteiros e Funções
Estrutura de Dados - Aula 03 - Ponteiros e FunçõesEstrutura de Dados - Aula 03 - Ponteiros e Funções
Estrutura de Dados - Aula 03 - Ponteiros e Funções
Leinylson Fontinele
 
Pesquisar corretamente na internet
Pesquisar corretamente na internetPesquisar corretamente na internet
Pesquisar corretamente na internet
anapaulaoliveira
 
Svm Presentation
Svm PresentationSvm Presentation
Svm Presentation
shahparin
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
ananth
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Shahar Cohen
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
Ivo Andreev
 
K means
K meansK means
Software Requirements Specification Template
Software Requirements Specification TemplateSoftware Requirements Specification Template
Software Requirements Specification Template
Digitalya OPS
 
Implementation of Software Testing
Implementation of Software TestingImplementation of Software Testing
Implementation of Software Testing
Mahesh Kodituwakku
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
Robin Reni
 

What's hot (20)

Machine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative ModelsMachine Learning: Generative and Discriminative Models
Machine Learning: Generative and Discriminative Models
 
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdfCS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
CS8080_IRT_UNIT - III T3 NAIVE TEXT CLASSIFICATION.pdf
 
Coursera Machine Learning (by Andrew Ng)_강의정리
Coursera Machine Learning (by Andrew Ng)_강의정리Coursera Machine Learning (by Andrew Ng)_강의정리
Coursera Machine Learning (by Andrew Ng)_강의정리
 
Decision Tree in Machine Learning
Decision Tree in Machine Learning  Decision Tree in Machine Learning
Decision Tree in Machine Learning
 
Deep learning with tensorflow
Deep learning with tensorflowDeep learning with tensorflow
Deep learning with tensorflow
 
Unit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptxUnit 2 unsupervised learning.pptx
Unit 2 unsupervised learning.pptx
 
Logistic regression in Machine Learning
Logistic regression in Machine LearningLogistic regression in Machine Learning
Logistic regression in Machine Learning
 
What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?What is the Expectation Maximization (EM) Algorithm?
What is the Expectation Maximization (EM) Algorithm?
 
Elhabian lda09
Elhabian lda09Elhabian lda09
Elhabian lda09
 
Curso CSS 3 - Aula Introdutória com conceitos básicos
Curso CSS 3 - Aula Introdutória com conceitos básicosCurso CSS 3 - Aula Introdutória com conceitos básicos
Curso CSS 3 - Aula Introdutória com conceitos básicos
 
Estrutura de Dados - Aula 03 - Ponteiros e Funções
Estrutura de Dados - Aula 03 - Ponteiros e FunçõesEstrutura de Dados - Aula 03 - Ponteiros e Funções
Estrutura de Dados - Aula 03 - Ponteiros e Funções
 
Pesquisar corretamente na internet
Pesquisar corretamente na internetPesquisar corretamente na internet
Pesquisar corretamente na internet
 
Svm Presentation
Svm PresentationSvm Presentation
Svm Presentation
 
An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier An Overview of Naïve Bayes Classifier
An Overview of Naïve Bayes Classifier
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Prepare your data for machine learning
Prepare your data for machine learningPrepare your data for machine learning
Prepare your data for machine learning
 
K means
K meansK means
K means
 
Software Requirements Specification Template
Software Requirements Specification TemplateSoftware Requirements Specification Template
Software Requirements Specification Template
 
Implementation of Software Testing
Implementation of Software TestingImplementation of Software Testing
Implementation of Software Testing
 
Recommendation Systems
Recommendation SystemsRecommendation Systems
Recommendation Systems
 

Similar to Databases for Data Science

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
Clarence J M Tauro
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera, Inc.
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Introduction to asdfghjkln b vfgh n v
Introduction to asdfghjkln b vfgh n    vIntroduction to asdfghjkln b vfgh n    v
Introduction to asdfghjkln b vfgh n v
23mz02
 
managing big data
managing big datamanaging big data
managing big data
Suveeksha
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
Milos Milovanovic
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Institute of Contemporary Sciences
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Facultad de Informática UCM
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Databricks
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
Alexei Krasner
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
MongoDB
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016
Lucas Jellema
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
Andrew Lamb
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
James Serra
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
Felix Gessert
 
Evolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/SpecialistEvolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/Specialist
Tony Rogerson
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
Antonio Castellon
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
Open Analytics
 

Similar to Databases for Data Science (20)

Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
NoSQL_Night
NoSQL_NightNoSQL_Night
NoSQL_Night
 
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your DataCloudera Breakfast Series, Analytics Part 1: Use All Your Data
Cloudera Breakfast Series, Analytics Part 1: Use All Your Data
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Introduction to asdfghjkln b vfgh n v
Introduction to asdfghjkln b vfgh n    vIntroduction to asdfghjkln b vfgh n    v
Introduction to asdfghjkln b vfgh n v
 
managing big data
managing big datamanaging big data
managing big data
 
Planing and optimizing data lake architecture
Planing and optimizing data lake architecturePlaning and optimizing data lake architecture
Planing and optimizing data lake architecture
 
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 Planning and Optimizing Data Lake Architecture - Milos Milovanovic Planning and Optimizing Data Lake Architecture - Milos Milovanovic
Planning and Optimizing Data Lake Architecture - Milos Milovanovic
 
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...Paradigmas de procesamiento en  Big Data: estado actual,  tendencias y oportu...
Paradigmas de procesamiento en Big Data: estado actual, tendencias y oportu...
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
Extending Apache Spark SQL Data Source APIs with Join Push Down with Ioana De...
 
PostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQLPostgreSQL as an Alternative to MSSQL
PostgreSQL as an Alternative to MSSQL
 
Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101Ops Jumpstart: MongoDB Administration 101
Ops Jumpstart: MongoDB Administration 101
 
Fontys Lecture - The Evolution of the Oracle Database 2016
Fontys Lecture -  The Evolution of the Oracle Database 2016Fontys Lecture -  The Evolution of the Oracle Database 2016
Fontys Lecture - The Evolution of the Oracle Database 2016
 
2021 04-20 apache arrow and its impact on the database industry.pptx
2021 04-20  apache arrow and its impact on the database industry.pptx2021 04-20  apache arrow and its impact on the database industry.pptx
2021 04-20 apache arrow and its impact on the database industry.pptx
 
Microsoft Data Platform - What's included
Microsoft Data Platform - What's includedMicrosoft Data Platform - What's included
Microsoft Data Platform - What's included
 
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
NoSQL Data Stores in Research and Practice - ICDE 2016 Tutorial - Extended Ve...
 
Evolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/SpecialistEvolution of the DBA to Data Platform Administrator/Specialist
Evolution of the DBA to Data Platform Administrator/Specialist
 
CIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis DesignCIKB - Software Architecture Analysis Design
CIKB - Software Architecture Analysis Design
 
No sql and sql - open analytics summit
No sql and sql - open analytics summitNo sql and sql - open analytics summit
No sql and sql - open analytics summit
 

More from Alexander Hendorf

Deep Learning for Fun and Profit [PyConDE 2018]
Deep Learning for Fun and Profit [PyConDE 2018]Deep Learning for Fun and Profit [PyConDE 2018]
Deep Learning for Fun and Profit [PyConDE 2018]
Alexander Hendorf
 
Agile Datenanalsyse - der schnelle Weg zum Mehrwert
Agile Datenanalsyse - der schnelle Weg zum MehrwertAgile Datenanalsyse - der schnelle Weg zum Mehrwert
Agile Datenanalsyse - der schnelle Weg zum Mehrwert
Alexander Hendorf
 
Einführung Datenanalyse mit Pandas [data2day]
Einführung Datenanalyse mit Pandas [data2day]Einführung Datenanalyse mit Pandas [data2day]
Einführung Datenanalyse mit Pandas [data2day]
Alexander Hendorf
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Alexander Hendorf
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
Alexander Hendorf
 
Data Mangling with mongoDB the Right Way [PyData London] 2016]
Data Mangling with mongoDB the Right Way [PyData London] 2016]Data Mangling with mongoDB the Right Way [PyData London] 2016]
Data Mangling with mongoDB the Right Way [PyData London] 2016]
Alexander Hendorf
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]
Alexander Hendorf
 
NoSQL oder: Freiheit ist nicht schmerzfrei - IT Tage
NoSQL oder: Freiheit ist nicht schmerzfrei - IT TageNoSQL oder: Freiheit ist nicht schmerzfrei - IT Tage
NoSQL oder: Freiheit ist nicht schmerzfrei - IT Tage
Alexander Hendorf
 
Neat Analytics with Pandas 4 3 [PyParis]
Neat Analytics with Pandas 4 3 [PyParis]Neat Analytics with Pandas 4 3 [PyParis]
Neat Analytics with Pandas 4 3 [PyParis]
Alexander Hendorf
 
Data analysis and visualization with mongo db [mongodb world 2016]
Data analysis and visualization with mongo db [mongodb world 2016]Data analysis and visualization with mongo db [mongodb world 2016]
Data analysis and visualization with mongo db [mongodb world 2016]
Alexander Hendorf
 
Time travel and time series analysis with pandas + statsmodels
Time travel and time series analysis with pandas + statsmodelsTime travel and time series analysis with pandas + statsmodels
Time travel and time series analysis with pandas + statsmodels
Alexander Hendorf
 
Data mangling with mongo db the right way [pyconit 2016]
Data mangling with mongo db the right way [pyconit 2016]Data mangling with mongo db the right way [pyconit 2016]
Data mangling with mongo db the right way [pyconit 2016]
Alexander Hendorf
 

More from Alexander Hendorf (12)

Deep Learning for Fun and Profit [PyConDE 2018]
Deep Learning for Fun and Profit [PyConDE 2018]Deep Learning for Fun and Profit [PyConDE 2018]
Deep Learning for Fun and Profit [PyConDE 2018]
 
Agile Datenanalsyse - der schnelle Weg zum Mehrwert
Agile Datenanalsyse - der schnelle Weg zum MehrwertAgile Datenanalsyse - der schnelle Weg zum Mehrwert
Agile Datenanalsyse - der schnelle Weg zum Mehrwert
 
Einführung Datenanalyse mit Pandas [data2day]
Einführung Datenanalyse mit Pandas [data2day]Einführung Datenanalyse mit Pandas [data2day]
Einführung Datenanalyse mit Pandas [data2day]
 
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
Introduction to Pandas and Time Series Analysis [Budapest BI Forum]
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Data Mangling with mongoDB the Right Way [PyData London] 2016]
Data Mangling with mongoDB the Right Way [PyData London] 2016]Data Mangling with mongoDB the Right Way [PyData London] 2016]
Data Mangling with mongoDB the Right Way [PyData London] 2016]
 
Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]Introduction to Data Analtics with Pandas [PyCon Cz]
Introduction to Data Analtics with Pandas [PyCon Cz]
 
NoSQL oder: Freiheit ist nicht schmerzfrei - IT Tage
NoSQL oder: Freiheit ist nicht schmerzfrei - IT TageNoSQL oder: Freiheit ist nicht schmerzfrei - IT Tage
NoSQL oder: Freiheit ist nicht schmerzfrei - IT Tage
 
Neat Analytics with Pandas 4 3 [PyParis]
Neat Analytics with Pandas 4 3 [PyParis]Neat Analytics with Pandas 4 3 [PyParis]
Neat Analytics with Pandas 4 3 [PyParis]
 
Data analysis and visualization with mongo db [mongodb world 2016]
Data analysis and visualization with mongo db [mongodb world 2016]Data analysis and visualization with mongo db [mongodb world 2016]
Data analysis and visualization with mongo db [mongodb world 2016]
 
Time travel and time series analysis with pandas + statsmodels
Time travel and time series analysis with pandas + statsmodelsTime travel and time series analysis with pandas + statsmodels
Time travel and time series analysis with pandas + statsmodels
 
Data mangling with mongo db the right way [pyconit 2016]
Data mangling with mongo db the right way [pyconit 2016]Data mangling with mongo db the right way [pyconit 2016]
Data mangling with mongo db the right way [pyconit 2016]
 

Recently uploaded

Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
dataschool1
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
Vineet
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
Vietnam Cotton & Spinning Association
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
Vineet
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
zoykygu
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
ytypuem
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Marlon Dumas
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
eudsoh
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
Timothy Spann
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
Alireza Kamrani
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
newdirectionconsulta
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
zsafxbf
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
22ad0301
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
uevausa
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
TeukuEriSyahputra
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
Rebecca Bilbro
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
perranet1
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
actyx
 

Recently uploaded (20)

Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
A gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented GenerationA gentle exploration of Retrieval Augmented Generation
A gentle exploration of Retrieval Augmented Generation
 
Digital Marketing Performance Marketing Sample .pdf
Digital Marketing Performance Marketing  Sample .pdfDigital Marketing Performance Marketing  Sample .pdf
Digital Marketing Performance Marketing Sample .pdf
 
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
[VCOSA] Monthly Report - Cotton & Yarn Statistics May 2024
 
Senior Software Profiles Backend Sample - Sheet1.pdf
Senior Software Profiles  Backend Sample - Sheet1.pdfSenior Software Profiles  Backend Sample - Sheet1.pdf
Senior Software Profiles Backend Sample - Sheet1.pdf
 
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
一比一原版(heriotwatt学位证书)英国赫瑞瓦特大学毕业证如何办理
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
一比一原版(曼大毕业证书)曼尼托巴大学毕业证如何办理
 
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
Discovering Digital Process Twins for What-if Analysis: a Process Mining Appr...
 
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
一比一原版马来西亚博特拉大学毕业证(upm毕业证)如何办理
 
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
06-20-2024-AI Camp Meetup-Unstructured Data and Vector Databases
 
How To Control IO Usage using Resource Manager
How To Control IO Usage using Resource ManagerHow To Control IO Usage using Resource Manager
How To Control IO Usage using Resource Manager
 
SAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content DocumentSAP BW4HANA Implementagtion Content Document
SAP BW4HANA Implementagtion Content Document
 
一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理一比一原版莱斯大学毕业证(rice毕业证)如何办理
一比一原版莱斯大学毕业证(rice毕业证)如何办理
 
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdfNamma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
Namma-Kalvi-11th-Physics-Study-Material-Unit-1-EM-221086.pdf
 
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
一比一原版加拿大渥太华大学毕业证(uottawa毕业证书)如何办理
 
Template xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptxTemplate xxxxxxxx ssssssssssss Sertifikat.pptx
Template xxxxxxxx ssssssssssss Sertifikat.pptx
 
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
PyData London 2024: Mistakes were made (Dr. Rebecca Bilbro)
 
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdfreading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
reading_sample_sap_press_operational_data_provisioning_with_sap_bw4hana (1).pdf
 
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
一比一原版斯威本理工大学毕业证(swinburne毕业证)如何办理
 

Databases for Data Science

  • 2. Alexander C. S. Hendorf − Senior Consultant Information Technology − Program Chair EuroPython, PyConDE & PyData Karlsruhe, 
 EuroSciPy, PSF Managing Member − PyData Frankfurt and PyData Südwest Organiser − Program Committee Percona Live − MongoDB Masters / MongoDB Certified DBA − Speaker Europe & USA MongoDB World New York / San José, PyCon Italy, CEBIT, BI Forum, IT-Tage FFM, PyData, PyParis, PyCon UK, Budapest BI,…. ah@koenigsweg.com @hendorf
  • 3. Services Data We guide our clients through development and implementation processes of technologies and applications to analyse, evaluate and visualize business data. Strategy & Operations We advise SME, start-ups and public institutions on building efficient sales structures, on process optimization and on business development. Communication We provide sound communication strategies and creative campaigns to communicate your content and messages authentically throughout all channels. Financial Service Technologies Our industry experts support clients in the financial sector in developing powerful and compliant FinTech applications.
  • 4. ! Let‘s get to know each other! » What‘s your background?« − Data scientist − Database admin − Curious Pythonista − Consultant, decision maker (IT Executive, Consultant, Innovation Manager, YouTube influencer) 
  • 5. " Let‘s get to know each other! » What‘s your Experience in…« − RDBSs − none − some − a lot − Hadoop et al. − none − some − a lot − NoSQL Systems 
 without Hadoop et al. − none − some − a lot
  • 6. # Let‘s get to know each other! » What‘s are you looking for?« − Integration for data science into existing ecosystems − Learning about databases for data science projects − General interest / curiosity − Just killing time until PyFiorentina − …?
  • 7. Three Angles for Databases for Data Science − Choosing a database for data science projects − Evaluating an existing database for data science requirements − How to integrate into an existing ecosystem
  • 8.
  • 9. $ „How deeply do data science & data base understand each other?.“
  • 10. $ Ask Google: Data Scientist? Database Admin? %#&' ()*+ ,
  • 11.
  • 12.
  • 14. What are the Benefits of a Database for Data Science Anyway? − Common source of data − Avoiding redundancy (e.g. files) − Persistence − Optimized for handling and accessing data for decades − Scalability − Staying very close to the data
  • 15. A Quick Recap on Database History − 1960s, navigational DBMS (disks & drums) − 1970s, relational DBMS − 1980s, on the desktop − 1990s, object-oriented − 2000s, NoSQL
  • 16. Relational Databases − Records are organised into tables − Rows of these table are identified by unique keys − Data spans multiple tables, linked via ids − Data is ideally normalised − Data can be denormalized for performance − Transactions are ACID [Atomic, Consistent , Isolated, Durable]
  • 17.
  • 19. Relational Databases Benefits − Widely used and supported − Normalized data − Comprehensive querying via SQL language 
 - though some differences between databases − Well researched and optimized over decades
  • 20.
  • 21.
  • 22.
  • 23. Relational Databases Downsides − Schemas are fixed and have to be pre-defined upfront (schema-first) − Altering schema is not trivial − Joining tables, depending on complexity, data volume may be costly,
 also consider overhead understanding a schema with many tables − Difficult to scale out − Few data structures (tables, rows)
  • 24. NoSQL Types − Key-Value Store
 simplest form of a NoSQL database (no big value for data science) − Document databases (JSON style)
 open schema
 can handle complex data structures as arrays and list − Wide column databases, most like relational DBs:
 columns are not fixed, data is de-normalised, 
 can handle complex data structures as arrays and lists − Graph
 network of connected entities linked by edges with properties, query on properties and links
  • 25. NoSQL Databases Benefits − No need to normalise data (schema-later) − Maintain complex data structures − Supports data sharding − New ways to query − Collections can be copied
  • 26. NoSQL Databases Downsides − Eventual Consistency (is this a real problem for data science at all?) − Flexibility requires more responsibility (schema, attribute typos) − Complexity
  • 27. A Quick Bird’s Eye View There are hundreds of databases around nowadays. Let’s focus on the top database systems.
  • 29. NoSQL RDBMS ~10 years 40 years + Points according to https://db-engines.com/de/ranking
  • 30. Consistency Models of Databases − A tomicity − C onsistency − I solation − D urability − B − A asic Availability − S oft-State − E ventual consistency
  • 31. Open Source Check − Security − Transparency − Engaging Collaboration − Quality − Auditability − Try Before You Buy (EE) − Rule of Thumb: 
 Open Software is way more affordable than closed * via vendors Open Source Enterprise Editions Oracle x + MySQL +-?!? + MicroSoft SQL Server + PostgreSQL + +* DB2 + MongoDB + + Redis + +* Cassandra + +* HBase + +* Amazon DynamoDB DAAS Neo4J + +
  • 32. The Contenders Type Chosen PostgreSQL RDBMS Top OS RDBMS MongoDB Document-store Top NoSQL (DS) Cassandra Wide-column store Top NoSQL (WCS) Neo4J Graph Top NoSQL (Graph)
  • 33. Relational Database Management System Dat aba se
  • 34.
  • 37.
  • 38. How Hard is it to Collect Data? Data Collection, cleaning and restructuring Multiple data sources Data retention PostgreSQL Depends on schema complexity Depends on schema complexity easy MongoDB easy easy medium Cassandra easy easy hard Neo4J (easy) N/A easy
  • 39. What about Data Types enforced flexible enforceable PostgreSQL yes (NoSQL feature) predefined MongoDB possible yes yes Cassandra yes untyped collection columns predefined Neo4J (yes) N/A N/A
  • 40. How Hard is it to Consolidate Data? Linking Missing data Dirty data Persisting cleaned dataset PostgreSQL built schema pre-processing recommended pre-processing recommended easy MongoDB easy (within db) flexible post-processing flexible post-processing easy Cassandra partitioning hard hard easy Neo4J yes* hard hard easy
  • 41. How Hard is it to Write Queries Against These Databases? Language Basic Queries Advanced Queries PostgreSQL SQL easy hard MongoDB MQL query: easy aggregation: medium query: medium aggregation: medium Cassandra CSQL easy hard Neo4J Cypher easy hard
  • 42. How Hard is Querying to Learn? Language Basic Queries Advanced Queries PostgreSQL SQL easy hard MongoDB MQL query: easy aggregation: easy query: medium aggregation: medium Cassandra CSQL easy-medium hard Neo4J Cypher medium hard
  • 43. SQL Benefits and Downsides − Common standard − Long-established − Mother of many others e.g. CSQL, ABAP,
 Pig, SPARQL,… − Set based logic − Complexity increases fast − Badly designed JOINs vs. performance − Overhead understanding a large schema − Set based logic
  • 44. SELECT EmployeeID, FirstName, LastName, HireDate, City
 FROM Employees
 WHERE HireDate BETWEEN '1-june-1992' AND '15-december-1993' SQL
  • 45. SELECT A.SD1, B.ED1 FROM
 
 (SELECT SD1, ROW_NUMBER() OVER (ORDER BY SD1) AS RN1 FROM (SELECT T1.Start_Date AS SD1, T2.Start_Date AS SD2 FROM (SELECT * FROM Projects ORDER BY Start_Date) T1
 
 LEFT JOIN (SELECT * FROM Projects ORDER BY Start_Date) T2
 
 ON T1.Start_Date=(T2.Start_Date+1)
 
 ORDER BY T1.Start_Date) WHERE SD2 IS NULL) A
 
 INNER JOIN
 
 (SELECT ED1, ROW_NUMBER() OVER (ORDER BY ED1) AS RN2 FROM (SELECT T1.End_Date AS ED1, T2.Start_Date AS SD2 FROM (SELECT * FROM Projects ORDER BY Start_Date) T1
 
 LEFT JOIN (SELECT * FROM Projects ORDER BY Start_Date) T2
 
 ON T1.End_Date=(T2.Start_Date) ORDER BY T1.Start_Date) WHERE SD2 IS NULL) B
 
 ON A.RN1=B.RN2
 
 ORDER BY (B.ED1-A.SD1), A.SD1;
  • 46. SELECT * FROM numberOfRequests
 WHERE cluster = ‘cluster1’
 AND date = ‘2015-06-05’
 AND datacenter = 'US_WEST_COAST'
 AND (hour, minute) IN ((14, 0), (15, 0)); Cassandra
  • 47. MATCH (c:Customer {companyName:"Drachenblut Delikatessen"})
 OPTIONAL MATCH (p:Product)<-[pu:PRODUCT]-(:Order)<-[:PURCHASED]-(c)
 RETURN p.productName, toInt(sum(pu.unitPrice * pu.quantity)) AS volume
 ORDER BY volume DESC; Neo4J
  • 48. pipeline = [     {"$match": {"artistName": “Suppenstar"}},     {"$sort": {„info.releaseDate: 1)])},     {"$group": {        "_id": {"$year": "$info.releaseDateEpoch"}, "count": {"$sum": "1}}},     {"$project": {"year": "$_id.year", "count": 1}}}, ] MongoDB Aggregation Pipeline
  • 49.
  • 50. −$match −$sort −$limit −$project −$group −$unwind −$lookup −WHERE | HAVING −ORDER BY −LIMIT −SELECT −GROUP BY −(JOIN) −LEFT OUTER JOIN Aggregation Pipeline / SQL
  • 51. How Hard is it to Run? − Installation − Maintenance − Cleaning up − Compacting − Backup − Replica (or continuous) − File System backup − Dump Set-Up Maintenance Backup PostgreSQL easy medium medium MongoDB easy low easy (Replica / CS) Cassandra easy intense easy (Replica) Neo4J easy medium easy
  • 52. How Hard is Run Analytics without Affecting Production Performance? replica shard PostgreSQL medium medium
 (if run on overnight backup) MongoDB easy: hidden replica node medium: hidden replica node with shard-key Cassandra depends on number of nodes & partitioning depends on number of nodes & partitioning Neo4J medium medium
  • 53. Collection Shard1 Shard2 Shard3 Shard4 server-1 server-2 server-3 server-4 server CollectionCollection server server Replica-Set horizontal scaling: one primary + copies Sharding vertical scaling split the data across nodes one server - utilize multiple cpu + IO "Micro-Sharding"
  • 54. How Hard is it to Integrate into Existing Systems? task type PostgreSQL easy just an additional SQL database MongoDB easy replica suggests multiple servers Cassandra medium requires multiple servers Neo4J easy just an additional database
  • 55. How Hard is it to Access / Change These Systems (Authorization)? User Auth Granularity PostgreSQL Role-Model Field MongoDB Role-Model Collection level Cassandra Role-Model Table Neo4J Fixed Roles Graph
  • 56. How Hard is it to Add New Data? Known attributes Unknown (ext.) data PostgreSQL easy - medium medium - hard PostgreSQL also has a NoSQL feature MongoDB easy easy Cassandra easy easy Neo4J easy N/A data needs to be graph
  • 57. How Hard is it to Understand the Data Structure? small system medium system extensive system PostgreSQL easy medium (partitioned) hard MongoDB easy easy easy - medium Cassandra easy hard (highly partitioned) hard (highly partitioned) Neo4J easy easy easy - medium
  • 58. RowKey: john => (column=, value=, timestamp=1374683971220000) => (column=map1:doug, value='555-1579', timestamp=1374683971220000) => (column=map1:patricia, value='555-4326', timestamp=1374683971220000) => (column=list1:26017c10f48711e2801fdf9895e5d0f8, value='doug', timestamp=1374683971220000) => (column=list1:26017c12f48711e2801fdf9895e5d0f8, value='scott', timestamp=1374683971220000) => (column=set1:'patricia', value=, timestamp=1374683971220000) => (column=set1:'scott', value=, timestamp=1374683971220000) Cassandra
  • 61.
  • 63. How Hard is it to Handle Growth? read capacity PostgreSQL medium advanced MongoDB easy medium Cassandra medium medium Neo4J medium medium
  • 64. Some More Use Cases for Databases in Data Science − Storing model parameters (even models) − Documenting experiments − Collecting performance metrics of models − …
  • 65. Conclusion − Analyse all your real needs and focus on those − Chose an accessible, simple solution (don‘t reach out to high) − Do not only focus on performance − Try, play and test before making final decisions − If you have a very specific use case go for the specialist system − A good choice for general purpose is MongoDB − If you work only on graphs use Neo4J − If you have simple tables and know SQL got for a RDBMS
  • 66. My Advise If You Are New in the Database Space. − Document store is easy to understand and maintain − Less querying overhead for multi-dimensional data − Aggregation pipeline − Grouping − $relational (LO JOIN) − $graphLookup − Many built-in operators − Easy install and replicate − Compressed storage by default, in-memory avail. − Learning at least Basic SQL and Set Theory is a MUST − SQLAlchemy if you work with RDBMS
  • 67. Thanks for Contributing Databases for Data Science is still an actively discussed topic in the experts‘ community. This presentation will be constantly updated. Newer findings and updates will be added. Stay informed: Follow me on Twitter @hendorf or LinkedIn Or drop me an email ah@koenigsweg.com −Jens Dittrich, Professor bigdata.uni-saarland.de @jensdittrich