Data Bases - Introduction to data scienceFrank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Lecturer: Frank Kienle, Head of AI and Data Science, Camelot ITLab
Topic: introduction to data bases
Recently, in the fields Business Intelligence and Data Management, everybody is talking about data science, machine learning, predictive analytics and many other “clever” terms with promises to turn your data into gold. In this slides, we present the big picture of data science and machine learning. First, we define the context for data mining from BI perspective, and try to clarify various buzzwords in this field. Then we give an overview of the machine learning paradigms. After that, we are going to discuss - at a high level - the various data mining tasks, techniques and applications. Next, we will have a quick tour through the Knowledge Discovery Process. Screenshots from demos will be shown, and finally we conclude with some takeaway points.
Data Bases - Introduction to data scienceFrank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Lecturer: Frank Kienle, Head of AI and Data Science, Camelot ITLab
Topic: introduction to data bases
Recently, in the fields Business Intelligence and Data Management, everybody is talking about data science, machine learning, predictive analytics and many other “clever” terms with promises to turn your data into gold. In this slides, we present the big picture of data science and machine learning. First, we define the context for data mining from BI perspective, and try to clarify various buzzwords in this field. Then we give an overview of the machine learning paradigms. After that, we are going to discuss - at a high level - the various data mining tasks, techniques and applications. Next, we will have a quick tour through the Knowledge Discovery Process. Screenshots from demos will be shown, and finally we conclude with some takeaway points.
Introducing Big Data concepts & Hadoop to those who wish to begin their journey in the future of Information Technology. It is certain that data is going to play a major role in days to come, from our daily lives to the biggest of the venture we might undertake.And hence, knowing about Big data and technologies to work with the same is going to be essential for IT professionals.
We will start with some simple presentations and then will go on building upon it. For more intense, focused introduction & training on Big Data and related technologies, visit our website or write to us.
Everyone is awash in the new buzzword, Big Data, and it seems as if you can’t escape it wherever you go. But there are real companies with real use cases creating real value for their businesses by using big data. This talk will discuss some of the more compelling current or recent projects, their architecture & systems used, and successful outcomes.
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
BDaas- BigData as a service by "Sherya Pal" from "Saama". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Introducing Big Data concepts & Hadoop to those who wish to begin their journey in the future of Information Technology. It is certain that data is going to play a major role in days to come, from our daily lives to the biggest of the venture we might undertake.And hence, knowing about Big data and technologies to work with the same is going to be essential for IT professionals.
We will start with some simple presentations and then will go on building upon it. For more intense, focused introduction & training on Big Data and related technologies, visit our website or write to us.
Everyone is awash in the new buzzword, Big Data, and it seems as if you can’t escape it wherever you go. But there are real companies with real use cases creating real value for their businesses by using big data. This talk will discuss some of the more compelling current or recent projects, their architecture & systems used, and successful outcomes.
Big Data Analysis Patterns with Hadoop, Mahout and Solrboorad
Big Data Analysis Patterns: Tying real world use cases to strategies for analysis using big data technologies and tools.
Big data is ushering in a new era for analytics with large scale data and relatively simple algorithms driving results rather than relying on complex models that use sample data. When you are ready to extract benefits from your data, how do you decide what approach, what algorithm, what tool to use? The answer is simpler than you think.
This session tackles big data analysis with a practical description of strategies for several classes of application types, identified concretely with use cases. Topics include new approaches to search and recommendation using scalable technologies such as Hadoop, Mahout, Storm, Solr, & Titan.
Data Integration Alternatives: When to use Data Virtualization, ETL, and ESBDenodo
Data integration is paramount, in this presentation you will find three different paradigms: using client-side tools, creating traditional data warehouses and the data virtualization solution - the logical data warehouse, comparing each other and positioning data virtualization as an integral part of any future-proof IT infrastructure.
This presentation is part of the Fast Data Strategy Conference, and you can watch the video here goo.gl/1q94Ka.
BDaas- BigData as a service by "Sherya Pal" from "Saama". The presentation was done at #doppa17 DevOps++ Global Summit 2017. All the copyrights are reserved with the author
The right architecture is key for any IT project. This is especially the case for big data projects, where there are no standard architectures which have proven their suitability over years. This session discusses the different Big Data Architectures which have evolved over time, including traditional Big Data Architecture, Streaming Analytics architecture as well as Lambda and Kappa architecture and presents the mapping of components from both Open Source as well as the Oracle stack onto these architectures.
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
Offline and stream processing of big data sets can be done with tools such as Hadoop, Spark, and Storm, but what if you need to process big data at the time a user is making a request? Vespa (http://www.vespa.ai) allows you to search, organize and evaluate machine-learned models from e.g TensorFlow over large, evolving data sets with latencies in the tens of milliseconds. Vespa is behind the recommendation, ad targeting, and search at Yahoo where it handles billions of daily queries over billions of documents.
Big Data Architectures @ JAX / BigDataCon 2016Guido Schmutz
Mit der Architektur steht und fällt jedes IT-Projekt. Das gilt in noch stärkerem Maße für Big-Data-Projekte, denn hier konnten noch keine Standards über Jahrzehnte ihre Tauglichkeit beweisen. Dennoch verbreiten und etablieren sich auch hier gute und effektive Lösungen. Der Vortrag erklärt, welche Bausteine wichtig für die verschiedenen Einsatzmöglichkeiten im Big-Data-Umfeld sind, und wie sie in konkrete Lösungen gegossen werden können. Dabei beleuchtet er sowohl traditionelle Big-Data-Architekturen als auch aktuelle Ansätze, wie z. B. die Lambda- und die Kappa-Architektur. Ebenfalls ein Thema sind Stream-Processing-Infrastrukturen und ihre Kombination mit Big-Data-Technologien. Ausgehend von einer produkt- und technologieunabhängigen Referenzarchitektur stellt dieser Vortrag verschiedene Lösungsmöglichkeiten auf Basis von Open-Source-Komponenten vor.
this is part 3 of the series on Data Mesh ... looking at the intersection of microservices architecture concepts, data integration / replication technologies and log-based stream integration techniques. This webinar was mostly a demonstration, but several slides used to setup the demo are included here as a PDF for viewers.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch full webinar here: https://bit.ly/32c6TnG
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- About the success McCormick has had as a result of seasoning the Machine Learning and Blockchain Landscape with data virtualization
Hadoop was born out of the need to process Big Data.Today data is being generated liked never before and it is becoming difficult to store and process this enormous volume and large variety of data, In order to cope this Big Data technology comes in.Today Hadoop software stack is go-to framework for large scale,data intensive storage and compute solution for Big Data Analytics Applications.The beauty of Hadoop is that it is designed to process large volume of data in clustered commodity computers work in parallel.Distributing the data that is too large across the nodes in clusters solves the problem of having too large data sets to be processed onto the single machine.
Data Virtualization. An Introduction (ASEAN)Denodo
Watch full webinar here: https://bit.ly/3uiXVoC
What is Data Virtualization and why do I care? In this webinar we intend to help you understand not only what Data Virtualization is but why it's a critical component of any organization's data fabric and how it fits. How data virtualization liberates and empowers your business users via data discovery, data wrangling to generation of reusable reporting objects and data services. Digital transformation demands that we empower all consumers of data within the organization, it also demands agility too. Data Virtualization gives you meaningful access to information that can be shared by a myriad of consumers.
Watch on-demand this session to learn:
- What is Data Virtualization?
- Why do I need Data Virtualization in my organization?
- How do I implement Data Virtualization in my enterprise? Where does it fit..?
Using real time big data analytics for competitive advantageAmazon Web Services
Many organisations find it challenging to successfully perform real-time data analytics using their own on premise IT infrastructure. Building a system that can adapt and scale rapidly to handle dramatic increases in transaction loads can potentially be quite a costly and time consuming exercise.
Most of the time, infrastructure is under-utilised and it’s near impossible for organisations to forecast the amount of computing power they will need in the future to serve their customers and suppliers.
To overcome these challenges, organisations can instead utilise the cloud to support their real-time data analytics activities. Scalable, agile and secure, cloud-based infrastructure enables organisations to quickly spin up infrastructure to support their data analytics projects exactly when it is needed. Importantly, they can ‘switch off’ infrastructure when it is not.
BluePi Consulting and Amazon Web Services (AWS) are giving you the opportunity to discover how organisations are using real time data analytics to gain new insights from their information to improve the customer experience and drive competitive advantage.
Advanced Analytics and Machine Learning with Data VirtualizationDenodo
Watch: https://bit.ly/2DYsUhD
Advanced data science techniques, like machine learning, have proven an extremely useful tool to derive valuable insights from existing data. Platforms like Spark, and complex libraries for R, Python and Scala put advanced techniques at the fingertips of the data scientists. However, these data scientists spent most of their time looking for the right data and massaging it into a usable format. Data virtualization offers a new alternative to address these issues in a more efficient and agile way.
Attend this webinar and learn:
- How data virtualization can accelerate data acquisition and massaging, providing the data scientist with a powerful tool to complement their practice
- How popular tools from the data science ecosystem: Spark, Python, Zeppelin, Jupyter, etc. integrate with Denodo
- How you can use the Denodo Platform with large data volumes in an efficient way
- How Prologis accelerated their use of Machine Learning with data virtualization
Gilbane 2009 -- How Can Content Management Software Keep Pace?weisinger
The amount of data stored is growing at a phenomenal rate. This paper documents the growth and suggests that a new standard, CMIS, may be useful in getting better control over data and data repositories.
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
So many buzzwords of late: Data Lakehouse, Data Mesh, and Data Fabric. What do all these terms mean and how do they compare to a data warehouse? In this session I’ll cover all of them in detail and compare the pros and cons of each. I’ll include use cases so you can see what approach will work best for your big data needs.
Machine Learning part 3 - Introduction to data science Frank Kienle
Lecture: Introduction to Data Science
given 2017 at Technical University of Kaiserslautern, Germany
Topic: part 3 machine learning, link to data science practice
Lecture summary: architectures for baseband signal processing of wireless com...Frank Kienle
Lecture summary: Architectures for baseband signal processing of wireless communications systems, given 2011 - 2013 at Technical University of Kaiserslautern, Germany
Introduction about Monte Carlo Methods, lecture given at Technical University of Kaiserslautern 2014.
There are many situations where Monte Carlo Methods are useful to solve data science problems
data scientist the sexiest job of the 21st centuryFrank Kienle
Invited talk, describing the exciting work at Blue Yonder (www.blue-yonder.com),
'congress smart services - new business models' in Aachen, Germany 2015
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
StarCompliance is a leading firm specializing in the recovery of stolen cryptocurrency. Our comprehensive services are designed to assist individuals and organizations in navigating the complex process of fraud reporting, investigation, and fund recovery. We combine cutting-edge technology with expert legal support to provide a robust solution for victims of crypto theft.
Our Services Include:
Reporting to Tracking Authorities:
We immediately notify all relevant centralized exchanges (CEX), decentralized exchanges (DEX), and wallet providers about the stolen cryptocurrency. This ensures that the stolen assets are flagged as scam transactions, making it impossible for the thief to use them.
Assistance with Filing Police Reports:
We guide you through the process of filing a valid police report. Our support team provides detailed instructions on which police department to contact and helps you complete the necessary paperwork within the critical 72-hour window.
Launching the Refund Process:
Our team of experienced lawyers can initiate lawsuits on your behalf and represent you in various jurisdictions around the world. They work diligently to recover your stolen funds and ensure that justice is served.
At StarCompliance, we understand the urgency and stress involved in dealing with cryptocurrency theft. Our dedicated team works quickly and efficiently to provide you with the support and expertise needed to recover your assets. Trust us to be your partner in navigating the complexities of the crypto world and safeguarding your investments.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. 1. Understand the business
2. Understand data
3. Prepare data
4. Modell
5. Evaluation
6. Deployment
CRISP Value Process
Frank Kienle
3. Data are individual units of
information
We store more and more data which
leads to
Big Data
Data to Big Data
Frank Kienle
4. Erik Larson, Harper’s magazine:
‘The keepers of big data say they do it for the consumer’s benefit. But data have a way of being used for purposes
other than originally intended.’
(Reality today: private data is becoming commoditized)
Big Data definitions 1989
Frank Kienle
5. Doug Laney, Gartner,2001:
,3-D Data Management:
Controlling Data
Volume, Velocity and Variety’
Big data definition 2001
Source: avyamuthanna.files.wordpress.com/2013/01/picture-11.png
Frank Kienle
6. Big Data is any data that is expensive to manage and hard to extract value from
(Souce: Michael Franklin, Dirctor of the Algorithms, Machines and Computer Science, Unverisity of Berkeley)
Extracting value out of big data is all about predicting the futures based on
observation of the past
Big Data today: it’s all about value
Frank Kienle
7. Big Data: the four V’s https://www.ibmbigdatahub.com/infographic/four-vs-big-data
Frank Kienle
9. § up to 75 control devices in each BMW
§ ~ 1.000 individual configurations possible
§ ~1 GByte functional software, 15 GByte data in the
car
§ ~ 2.000 customer functions implemented
§ ~ 12.000 error storage memories for onboard
§ daily up to 60.000 diagnoses processes world
wide
§ centralized data storage and organization
§ data fusion and data mining for quality insurance
and better understanding of realistic
environments
Source: Bitcom BMW keynote talk
source: pixabay
Frank Kienle
10. Tracking the data in a car can
have benefits
but
comes with security / privacy
challenges
See lecture on ethical
challenges
Big Data Sources: Car black boxes
source: Los Angeles Times
Frank Kienle
11. A gas turbine has up 1000 sensors
§ Each sensor can (theoretically) processes data in the
millisecond range
§ example real live set up:
§ averages are stored per second (history kept for
one year)
§ often long history available, e.g. up to year 2000 in
5 minutes range (averages)
IoT Sensor Data example: Gas Turbine
source: pixabay
Frank Kienle
12. Realistic scenario store tuples: (timestamp, value)
• new sensors will be introduced, sensors might change
Theoretical data stream storage, gas turbine example
§(timestamp, value) 64 Byte X 1000 sensors à
Reality:
► 1 year stored in 1 s averages:
► 10 years stored in 5min averages:
3.2 Mbyte Time: 1 s
276Mbyte Time: 1 day
100.9 GByte Time: 1 year
64 kByte Time: 20 ms
x 100 engines in one data center à 10 TByte Time: 1 year
200 GByte Time: 1 year
~ 7 TByte Time: 10 years
Frank Kienle
13. Big Data Landscape - Data Lake Architecture
Components overview and terminology
14. mostly structuredsemi-structuredunstructured
The data lake is one part in the overall data to value path
§#123 §10101
§
Raw (Big) Data is
typically coming from
different sources and
has many different data
types
A data lake is a storage
repository that holds a vast
amount of (big) data in its
native format and provides
intelligent (semi-
structured) access until it is
needed
The value of data is
delivered via enterprise
systems / UX components
with the overall goal to
perform data driven
decisions
twitter www social
sensors mobile payments
transactions transport
video
Source Manage Value
pictures voice
Frank Kienle
15. Stages of Data in Data Lake – High Level Architecture
The data flow and used technology, tools, programming depend on data type and the final application layer
Business Systems
Business Systems
Business Systems
Business Systems
Data Sources
Delivery
Applications
Applications
ApplicationsApplications
&
Visualizations
Enriched
Data
Raw
Data
Ingestion
Transform /
Curate
File Transfer,
RDB Import
REST APIs
Stream or
batch transfer
WhatHow
Initial raw
raw data
storage
Distributed
Storage (e.g.
Hadoop)
Cleansing /
transform for
purpose
Distributed
Storage (e.g.
Hadoop)
add semantic,
searchable,
anonymized, …
Data bases for
purpose
semantic
data access
On request
data services
simplified data lake data path
16. Exemplary high-level walk through to extract, store and deliver trend information
Clean, structured data(Semi) Unstructured
or raw data
Mining big data
Information
retrieving
Data Lake
storing and mining relevant information
Final PresentationData Source
Drill down boards
WWW sources
Large-scaled
Web crawlers
download all
links found
Saved
webpages
Search and mine
data to extract
semantic
(relevance)
Structured
(graph)
database of
trends to
allow for easy
access
Relevant
Internet
Webpages
for topic
Trend Report
source: trends.google
17. A data to value architecture is composed of many building blocks
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
18. A data lake is often a fundamental part of the data to value stack and focuses on
the technical management of big data
Data sources and data
ingestion
Data Storage
Data Access / Pipelines
Value DeliveryDepending on the data type and
final business application
different elements are utilized
Business Application
Data
Governance
Functional Layer
Deployment / Physical
raw data input
valuable
data output
(often) focus of
data lake
architectures
19. Data Lake High level architecture with different possibilities to store, process, and
deliver valuable information
Text, emails
documents
Video,
Media
Voice, Music,
Sound
Unstructured
XML, JSON Sensor
Semi-structured data
Databases ERP core
Structured data
Data Sources
Stream
Batch
Hybrid
Data
Ingestion
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Depending on the data type and
final business application
different elements are utilized
IoT
Prescriptive
Business Application
Availability
Data Security
Compliance &
Controls
Data Governance
Functional Layer
…
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
20. Data Requirements
Which data are needed?
The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements
Row Based
Column
Based
Relational
Graph DB
Document
DB
Non-Relational
Key-Value
Data Storage
Stream Batch Interactive
Data Access / Pipelines
Descriptive Predictive
Value Delivery
Visualizations Interfaces
Operational
Prescriptive
Business Application
Availability
Data Security
Roles &
Responsibility
Data Quality
Reporting's
Tactical Strategic
Deployment
On premise Cloud Hybrid
Application
Life cycle
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-
functional
requirements
What
constraints?
21. The design of a data pipeline / data lake depends on the business, technical, non-
functional requirements – example questions to be unswered
Technical Requirements
How to realize?
Business Requirements
Why we need this?
Non-functional
requirements
What constraints?
Who is the customer (internal, external)?
How does it help in which situation / process?
Which value do we expect?
When we improve quality by x% which benefit do we expect?
How to visualize / serve the results / back integration?
Which service level has the solution (on request, 99%uptime)?
Where is the data allowed to be stored, e.g. GDPR?
Who has access to the application / data?
How is the support organized?
Which security level is granted?
How does the application provide the result, e.g. which technical interface?
How is the data stored, what are the latency requirements for read / write?
How to ensure a test / productive setup?
Where do we compute and which libraries?
Which algorithms serve best the requirements?
22. For each layer in the data stack many different vendors and applications exist
Data Storage
Data Access / Pipelines
Value Delivery
Business Application
Functional Layer
Deployment / Physical
Managing
big Data and
data pipelines
• Infrastructure und Hardware for Big Data
• Big Data Distributions (e.g.. Hadoop)
• Components for data management
(distributed data systems,
• in memory data bases,…)
Focus
Extracting
value
• Full business SaaS Services
• Tool boxes visualization
• Workflow enablement
23. Nearly all technical Big Data / Data Lakes are based on the (open source) Hadoop
& Ecosystem.
Com-
ponent
Description
HDFS The Hadoop Distributed File System.
Mahout Machine Learning on HDFS system
Zoo-
keeper
A centralized service for maintaining
synchronization and group services.
Yarn
Hadoop’s resource manager and job
scheduler.
HBase The Hadoop database.
Pig
A high-level data-flow language and execution
framework for parallel computation.
Spark
SQL
A module for structured and semi-structured
data processing.
Hive
A data warehouse infrastructure supporting
data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume A service for moving log data into Hadoop
Flume
Sqoop
Unstructured or semi-structured data Structured data
HDFS (Hadoop Distributed Files System)
HBase
Map Reduce Framework
Apache Oozie (Workflow)
Hive
DW System
PIG Latin
Data Analysis
Mahout
Machine Learning
Z
O
O
K
E
E
P
E
R
Data Storage
Data Access /
Pipelines
Ingestions
Functions
Focus
in stack
Components Layers
Frank Kienle
24. Nearly all Big Data / Data Lakes are based on the (open source) Hadoop & Ecosystem. However, only Enterprise Big Data
Platforms ensure a professional management
Component Description
Ambari
An open operational framework for provisioning, managing and monitoring Apache
Hadoop clusters.
HDFS The Hadoop Distributed File System.
Zookeeper
A centralized service for maintaining configuration information and naming, and for
providing distributed synchronization and group services.
Yarn Hadoop’s resource manager and job scheduler.
HBase The Hadoop database.
Pig A high-level data-flow language and execution framework for parallel computation.
Spark SQL A module for structured and semi-structured data processing.
Hive A data warehouse infrastructure supporting data summarization, query, and analysis.
Sqoop A tool to move data from RDBMS to Hadoop.
Flume
It is a distributed, reliable, and available service for efficiently collecting, aggregating,
and moving large amounts of log data into Hadoop.
Kafka A high-throughput, distributed, publish-subscribe messaging system.
Frank Kienle
25. Visualization Tools example for Data Scientists
(some practical tools/libraries, the purpose defines the tool)
Open source programming language,
active community participation, quick results
and must know-how for a data scientist
Focus on, interactive data visualizations
in web browsersava. Script library for
manipulating documents based on data
Most often used from nearly everybody
for visualization due to its mighty capabilities
and penetration
ExcelGeneral
Purpose Example
Web D3.js + derivates
Description
Rapid
Prototyping
Python (Matplotlib)
R (Shiny)
Professional
Visual Exploration
Tableau, Qlik
MS PowerBI
Professional interactive visualization tools
with focus on quick insights, with the goal
to provide business intelligence (BI) for an
enterprise
Focus
in stack
Visualization
Frank Kienle
26. Libraries/Algorithms/Programming/Tools
(some practical tools/libraries, the purpose defines the selection)
Query Languages and stream/batch processing
programming paradigms with ease access to
managed big data (there exist many more)
The two most important languages
for data science (there exist many more)
World wide most used tool for data
processing/calculation purposes with
mighty capabilities (mostly not know)
ExcelGeneral
Purpose Example
Statistics /
Machine Learning
Python + R
Description
(Big) Data
Processing
Spark + SQL
Tool Providers
Statistics/ML
SAS, Rapid Miner,
Knime, Matlab, …
Professional tools with the goal to provide
packaged, maintained and easy consumable
analytics for professional and citizen data
scientists
Focus
in stack
Functional Layer
Data Pipelines
Frank Kienle