DOAG Big Data Days 2017 - Cloud Journey

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Harald Erb
Oracle Business Analytics & Big Data
1
The New Data Lake
Oracle’s elastisch skalierbare Big Data Cloud
DOAG Big Data Days,
22. September 2017
Click-through version of Live-Demo

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 2
Referent
Harald Erb
Sales Engineer, Information Architect
Business Analytics & Big Data
+49 (0)6103 397-403
harald.erb@oracle.com
Meine bisherige
Business Analytics
Zeitreise

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for
information purposes only, and may not be incorporated into any contract. It is not a
commitment to deliver any material, code, or functionality, and should not be relied upon
in making purchasing decisions. The development, release, and timing of any features or
functionality described for Oracle’s products remains at the sole discretion of Oracle.
3

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 4
1 Einführung
Data Lake & Data Labs
Konzepte, Oracle Cloud

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
1876: Edison’s Invention Factory, Menlo Park, NJ

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.Copyright © 2017, Oracle and/or its affiliates. All rights reserved. 7
Line of Governance
Data Lake
Data
Processing
Data
EnrichmentRaw Data
Sets
Curated &
Transformed
Data Sets
Data
Aggregation
Data Lab
Sandboxes
Data Catalog
Data Discovery
Tools
Transformations
Prototyping
Analytic Tools
Enterprise
Information
Store
Operational
Data Store
Data Federation &
Virtualization Layer
CommonSQLAccessto
ALLData
Orchestration, Scheduling & Monitoring
Metadata Management
Data
Ingestion
Batch
Integration
Real-Time
Integration
Data
Streaming
Data
Wrangling
Data Discovery
/ Business
Intelligence
Data Driven
Applications
Advanced
Analytics
Non-structured
Sources
Logs
Social
Media
External
Data
Interactions
Structured Data
Master Data
Applications
Channels
Data Stores
Adhoc Files
or Data Sets
Data Management
Logische Architektur

Copyright © 2017, Oracle and/or its affiliates. All rights reserved.Copyright © 2017, Oracle and/or its affiliates. All rights reserved.
Data Lake
8
Data Lake
Intake Tier Management Tier Consumption Tier
Information Lifecycle Management Layer
Metadata Layer
Security & Governance Layer
Data
Discovery/
Business
Intelligence
Data Driven
Applications
Advanced
Analytics
Data Discovery
Data Provisioning
Source System Zone Transient Zone
Raw Zone
Connectivity
Processing
Interfaces for
ODBC
JDBC
NFS
File Shares
Web Services
REST
API
SFTP
Polling
Intake
Processing
Unstruktierte
Daten
Push/Realtime
Semi-strukt.
Daten
Push-Pull
Strukturierte
Daten
Pull
File Validation Checks
(Duplication, Integrity,
Size, Periodicity)
Data Integrity Checks
(Column/Rec. Counts,
Schema Validation)
Lineage Tracking
(Metadata Capture,
Watermarks)
Deep Integrity Checks
(Bit Level Scans,
Periodic Checksums)
Data HubIntegration
Data Profiling
Data Cleansing
Enrichment
Metadata Collection
Data Lineage Tracking
Transformation
Unstructured/
structured
Profiling (Data
completeness,
Correctness,
coherence)
Deletion
(Tuple, pairwise)
Imputation
(Mean/median
predicted value)
Structured Data
(Table/Attr. level)
Unstructured Data
(Word/Document
level: Stop words,
stemming,...)
Structured Data
(Aggr., Decompos.)
Unstructured Data
(Extract.,Tagging,
Entity Recognit. )
LoadDistribution
Vertical:Parti-tioning
(Range,Mod.,Key-Value,Random
Horizontal:Pipelining
Polystructured
Data Sources
Logs
Social Media
External Data
Interactions
Structured Data
Master Data
Applications
Channels
Data Stores
* ) Vgl. P. Pasupuleti, B. S. Purra
External
Access
Interfaces
for
SQL
JDBC
Web
Services
SFTP
Push-
/Pull-
based
Data Classification
(Named entity class, Topic
modelling, Text clustering)
Relation Extraction
(Column types, pattern, ref..
integrity, features, semantics)
Indexing Data
(Inverted Index, Faceted/Fuzzy
Search, Semantic Analysis)
Metadata publication
(Catalog of Raw and
Data Hub Zones)
Data formatting (Standard/
custom) & Data selection
(Row/column-, content-based)
Konzept und denkbare Funktionsbereiche *)

Based on
Raw Data
Full Access to
Data Sources
(Select only)
Complete
Sandbox
Environment
Agile
Experimentation
“Fail Fast”
Data Lab
Key Requirements

Alex Sadovsky,
Director of Data Science @ The Oracle Data Cloud
describes how to embrace cloud computing, Hive, and
Spark to create machine learning solutions at scale.
YouTube  URL
Warum Cloud?
Machine Learning at Scale  Cluster zeitweise massiv aber nicht permanent benutzen

“Data Scientists should not be
System Administrators
• If hardware fails, throw
it away
• If someone messes up the
OS, trash it
• No support tickets, no time
wasted”
“Data scientists should not
have to deal with system
administrators
• Science is about
experimentation
• Experimentation is about
testing boundaries
• No support tickets, no time
wasted”
“Don’t be afraid to throw money
(more computer resources) at a
problem
• Engineers and their time are
often more expensive than
computer resources
• “Burstable” solutions”
Warum Cloud?
Für experimentelles Arbeiten – einige Überlegungen von Alex Sadovsky

www.csm.ornl.gov/PR/clusters.jpg
Selbstbau ist teuer!
• Kosten für System Engineers
• Kosten für zusätzliche Entwicklungzeiten
• im Hinblick auf Personal (Gehalt)
• im Hinblick auf verlagerte Arbeit
• Kosten für besondere Developer Skills
• Leute mit Python/SQL-Kenntnissen sind
z.Z. noch leichter zu finden im Vergleich zu
Spark-/Hive-Spezialisten
• Be lazy. Warum nicht ~3..5€/Stunde für eine Single
Cloud instance bezahlen – anstatt eine eigene
Infrastruktur für Datenexperimente aufzubauen
und zu warten?
Warum Cloud?
Case Study: Der 10-Node-Selbstbau Spark Cluster

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. 14
Oracle Data Management & Analytics Plattform
DATA LAKE
Big Data Cloud Services
DEVELOPERS
BUSINESS
IT
ANALYSTS
COMPUTE  STORAGE  NETWORK  IDENTITY
ANALYTICS SERVICES
Oracle Analytics Cloud
LOCATION & NETWORK RELATIONSHIPS
Spatial, Graph
MACHINE LEARNING
ORAAH, R, Spark ML
SEARCH SMARTS PREDICTION LEARNING MOBILE NAT. LANG. PERSONALIZED
SOCIAL
SENSORS
PERSONAL
SaaS
MOBILE
ENTERPRISE
STORE & EXECUTE
Oracle Hadoop, Cloudera, NoSQL
CATALOG
Data Catalog, Cloud Navigator
QUERY
Elastic Search, SparkSQL
DATABASE INTEGRATION
Connectors, Big Data SQL
INTEGRATION
SERVICES
Data Integration
Cloud Service
BATCH
STREAMING
DATA
All Data • Real-time & Batch • Data Science & Business User • Agile • Scalable • Economical

Oracle Public Cloud
Data Scientists / Developer
Data
Sources
Streaming
Batch
Business Users
High
Performance
Messaging
Event Processing
and Cache
Metadata
Enrichment
Reporting
Database
Data
Discovery
&
Analytics
Reporting
Adaptor Based
Integration
Change
Data
Capture
Files
Database
Real-time
Cloud
Weitere..
Notebooks &
discovery SQL
access to
Data Lake
Long-term,
low cost data
storage
15
Oracle Data Lake & Analytics
Merkmale

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Business Users
Oracle
Analytics Cloud
Oracle Data Lake & Analytics
Files
Database
Real-time
Cloud
Weitere..
Schlüsseltechnologien und Services
KafkaAPI
Spark SQL
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
GoldenGate
RestAPI
RESTAPI
Object
Storage
Big Data
Prep.
Adaptors
Integration
Cloud Service
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
JDBC
JDBC
Oracle DB
Business
Intelligence
Data
Visualization
Essabase
BI Mobile/
Day by Day
SmartView/
Office
Web Browser
Data Visualiz.
Desktop
16
SparkSQL
TEZ
Spark

Copyright © 2017, Oracle and/or its affiliates. All rights reserved. |Copyright © 2017, Oracle and/or its affiliates. All rights reserved. | 17
2 Oracle Cloud Journey
“The New Data Lake”
Storage und Big Data Services einrichten

Basis Cloud-Module
Szenario: The New Data Lake
Oracle Public Cloud
Data
Sources
Batch
Files
Database
Real-time
Cloud
Weitere..
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPIKafka
Queue
Streaming
RestAPI
RESTAPI
Object
Storage
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark

Oracle Storage Cloud Service
Object Storage
• Sehr preiswerte & flexible Speicherung beliebiger Daten
• für strukturierte und unstrukturierte Daten
• Im Gegensatz zu bekannten Dateisystemen enthalten
Objekte zwar Daten, sind allerdings nicht in einer
Hierarchie organisiert.
• Jedes Objekt befindet sich auf der gleichen Ebene eines
Adressraums, wird mithilfe seiner erweiterten Metadaten
charakterisiert und bekommt einen einzigartigen
Identifikator zugewiesen.
• Server oder Endanwender können das Objekt beziehen,
müssen den physischen Standort der Daten nicht kennen.
• Diese Herangehensweise ist für die Automatisierung und
Rationalisierung der Datenspeicherung in Cloud-
Computing-Umgebungen nützlich

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Business Users
The New Data Lake
Files
Database
Real-time
Cloud
Weitere..
Object Storage einrichten und verwenden
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
RestAPI
RESTAPI
Object
Storage
Data Visualiz.
Desktop
20
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Business Users
The New Data Lake
Files
Database
Real-time
Cloud
Weitere..
Object Storage einrichten und Zugriff via CloudBerry *)
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
Data Visualiz.
Desktop
25
*) Infos + Download: www.cloudberrylab.com/solutions/oracle-cloud
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
RestAPI
RESTAPI
Object
Storage
SparkSQL
TEZ
Spark

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
The New Data Lake
SFDC
Eloqua
RightNow
Twitter
Weitere..
Object Storage: Kopie eines Bootstrap Skripts (für spätere automatisierte Installation/Konfig.)
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
31
Business Users
Data Visualiz.
Desktop
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark
RestAPI
RESTAPI
Object
Storage

Beispiel: Zeppelin Notebooks aus Github importieren
Bootstrap-Skripting für die Oracle Big Data Cloud

Beispiel: Installation von Anaconda inkl. TensorFlow und Konfiguration von Python
Bootstrap-Skripting für die Oracle Big Data Cloud

Object Storage: Upload Bootstrap Script via Web UI
The New Data Lake

Object Storage: Weitere Operationen via Web UI
The New Data Lake

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
The New Data Lake
Files
Database
Real-time
Cloud
Weitere..
Big Data Cloud Service einrichten
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
38
Business Users
Data Visualiz.
Desktop
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark
RestAPI
RESTAPI
Object
Storage

Big Data Cloud Service einrichten: Cluster Konfiguration
The New Data Lake
Demo-Sequenz

Big Data Cloud Service einrichten: Schlüssel für SSH-Zugriff
The New Data Lake
Demo-Sequenz

Big Data Cloud Service einrichten: Zugriff für SSH und AMBARI Admin Console anpassen
The New Data Lake

3 Oracle Cloud Journey – Teil 1
Coding & Analyse mit Notebooks
Big Data-Technologien voll ausnutzen

Notebooks
Dokumente mit live ausführbarem Code – in fast beliebigen Programmiersprachen

Demo Use Case
New York City Bikes: Historische und Streaming-Daten analysieren
Historische Daten
Download aus Amazon Cloud (AWS S3)
Echtzeitdaten

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Big Data Cloud Journey
Files
Database
Real-time
Cloud
Weitere..
Notebook Basics, Dateioperationen in HDFS und Object Store, Hive-Tabellen anlegen
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
65
Business Users
Data Visualiz.
Desktop
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark
RestAPI
RESTAPI
Object
Storage

Oracle Big Data Cloud Service – Compute Edition
Hive - Tez Query Execution Engine
docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.4.3/bk_performance_tuning/content/hive_perf_best_pract_config_tez.html

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Files
Database
Real-time
Cloud
Weitere..
Mit Spark Scala und Spark SQL arbeiten, Caching im Hauptspeicher
KafkaAPI
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
70
Business Users
Data Visualiz.
Desktop
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark
RestAPI
RESTAPI
Object
Storage
Spark Streaming
Alluxio

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Files
Database
Real-time
Cloud
Weitere..
Mit Data Discovery Tools arbeiten
KafkaAPI
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
73
Business Users
Data Visualiz.
Desktop
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark
RestAPI
RESTAPI
Object
Storage
Spark Streaming
Alluxio

Oracle Data Visualization
Data Set Management
Lightweight Data Profiling
Data Flow
Editor
Visual Analyzer
Datenaufbereitung, -verknüpfung und interaktive Analyse in einem Werkzeug
74

Oracle Data Visualization
Import, Refresh,
Verwaltung
der Data Sets
Prozessablauf
Neue Daten inspizieren,
auf Qualität/Vollständigkeit
prüfen und verstehen
Data Sets bereinigen, filtern
kombinieren, anreichern
(Data Pipelines bauen)
Interaktiv Analysieren,
Zusammenhänge erkennen
und visualisieren
Ergebnisse kommentieren,
Analyseschritte dokumentieren
(Story Telling)
Neue Datenquellen
hinzunehmen
Weitere vorhandene
Data Sets ansehen
Data Sets weiter
aufbereiten, anreichern
Komplett neue Fragen und
Analyseideen verfolgen
Neue Perspektiven und
Datensichten umsetzen
Mit dem Werkzeug
• können komplette Analyse-Projekte umgesetzt werden,
• ist Rapid Prototyping möglich (anstelle starrer Spezifikationen in Papierform)
• Lassen sich Einmal-Analysen umsetzen, die eine Erweiterung der Business Intelligence-Plattform
(noch) nicht rechtfertigen
75

Database as a Service (DBaaS)
Middle-Tier Schema
Oracle Storage Cloud Service (OSCS)
Backup, Restore, DataViz Mashup, …
Oracle
Data Visualization*)
Oracle Business
Intelligence
Oracle Day by Day
Oracle Essbase
Und jetzt das alles bitte “Enterprise Ready”
Oracle Analytics Cloud (OAC)
*) Import von Custom DV Plugins und R-Skripts bzw. zusätzliche R-
Pakets ist technisch möglich, der offizielle Support durch Oracle
aber noch in Vorbereitung
cloud.oracle.com/de_DE/oac

Oracle Public Cloud
Data Scientists
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Business Users
Oracle
Analytics Cloud
SFDC
Eloqua
RightNow
Twitter
Weitere..
Zusammenspiel mit Big Data Services
KafkaAPI
Spark SQL
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
GoldenGate
RestAPI
RESTAPI
Object
Storage
Big Data
Prep.
Adaptors
Integration
Cloud Service
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
JDBC
JDBC
Oracle DB
Business
Intelligence
Data
Visualization
Essabase
BI Mobile/
Day by Day
SmartView/
Office
Web Browser
Data Visualiz.
Desktop
79
SparkSQL
TEZ
Spark

Event Hub Service einrichten

Oracle Event Hub Cloud Service setzt auf Apache Kafka als Schlüsseltechnologie
The New Data Lake
Apache Kafka
Ein Message Broker, dessen Architektur die Verarbeitung von
Datenströmen mit sehr hohem Nachrichtendurchsatz bei
niedrigen Latenzen ermöglicht.
Wichtige Komponenten
• Anwendungen, die Daten in einen Kafka Cluster
schreiben, werden als Producer bezeichnet,
Anwendungen, die Daten von dort lesen, als Consumer.
• Daten, die an einen Kafka Cluster geschickt werden,
werden in sogenannten Topics gruppiert.
• Ein Topic kann wiederum in mehrere Partitionen
unterteilt sein, wobei jede Partition redundant auf
mehreren Knoten (Broker) gespeichert werden kann.
Innerhalb einer Partition werden die Datensätze in der
Reihenfolge in der sie geschrieben werden gespeichert.

Verwendungsmöglichkeiten von Kafka
The New Data Lake
Quelle: Confluent (www.confluent.io)

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Business Users
The New Data Lake
Files
Database
Real-time
Cloud
Weitere..
Event Hub Service (OEHCS) einrichten
KafkaAPI
Spark Streaming
Alluxio
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
Data Visualiz.
Desktop
83
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
RestAPI
RESTAPI
Object
Storage
SparkSQL
TEZ
Spark

OEHCS Konfiguration – Teil 1
The New Data Lake

OEHCS Konfiguration – Teil 2
The New Data Lake

5 Oracle Cloud Journey – Teil 2
Streaming Data
Big Data-Technologien voll ausnutzen

Oracle Public Cloud
R Studio
Cloud Berry
Web Browser
Notebooks
Data
Sources
Batch
Files
Database
Real-time
Cloud
Weitere..
Mit Kafka und Spark Streaming arbeiten (Simulation: Bike Usage als Echtzeit-Datenstrom)
KafkaAPI
Hbase API
HBase Python
KafkaAPI
Kafka
Queue
Streaming
110
Business Users
Data Visualiz.
Desktop
HiveSQL
TEZ
Hive TEZ
Alluxio
HTML
HTML
Notebook
SparkSQL
TEZ
Spark
RestAPI
RESTAPI
Object
Storage
Spark Streaming
Alluxio

Oracle Big Data Cloud Service – Compute Edition
Big Data File System (BDFS) – In-memory Caching Layer (von Alluxio)
docs.oracle.com/en/cloud/paas/big-data-compute-cloud/csspc/big-data-file-system-bdfs.html

Verwendung Event Hub Service im Demo-Szenario
Simulation: Echtzeit-Datenstrom
(Bike Usage)
Visualisierung Live Map

Demo-Sequenz

Cluster Scale out per Knopfdruck

DOAG Big Data Days 2017 - Cloud Journey

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DOAG Big Data Days 2017 - Cloud Journey

Similar to DOAG Big Data Days 2017 - Cloud Journey (20)

More from Harald Erb

More from Harald Erb (13)

Recently uploaded

Recently uploaded (20)

DOAG Big Data Days 2017 - Cloud Journey