Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

F
Open Source Big Data in OPC
Edelweiss Kammermann
Frank MunzJava One 2017
munz & more #2
© IT Convergence 2016. All rights reserved.
© IT Convergence 2016. All rights reserved.
About Me
à Computer Engineer, BI and Data Integration Specialist
à Over 20 years of Consulting and Project Management experience in Oracle
technology.
à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG)
à Director of Community of LAOUC
à Head of BI Team CMS at ITConvergence
à Writer and frequent speaker at international conferences:
à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum
à Oracle ACE Director
© IT Convergence 2016. All rights reserved.
Uruguay
6
Dr. Frank Munz
•Founded munz & more in 2007
•17 years Oracle Middleware,
Cloud, and Distributed Computing
•Consulting and
High-End Training
•Wrote two Oracle WLS and
one Cloud book
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
#1
Hadoop
© IT Convergence 2016. All rights reserved.
What is Big Data?
à Volume: The high amount of data
à Variety: The wide range of different data formats and schemas.
Unstructured and semi-structured data
à Velocity: The speed which data is created or consumed
à Oracle added another V in this definition
à Value: Data has intrinsic value—but it must be discovered.
© IT Convergence 2016. All rights reserved.
What is Oracle Big Data Cloud Compute Edition?
à Big Data Platform that integrates Oracle Big Data solution with
Open Source tools
à Fully Elastic
à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud
Service, Event Hub Cloud Service
à Access, Data and Network Security
à REST access to all the funcitonality
© IT Convergence 2016. All rights reserved.
Big Data Cloud Service – Compute Edition (BDCS-CE)
© IT Convergence 2016. All rights reserved.
BDCS-CE Notebook: Interactive Analysis
à Apache Zeppelin Notebook (version0.7) to interactively work with data
© IT Convergence 2016. All rights reserved.
What is Hadoop?
à An open source software platform for distributed storage and
processing
à Manage huge volumes of unstructured data
à Parallel processing of large data set
à Highly scalable
à Fault-tolerant
à Two main components:
à HDFS: Hadoop Distributed File System for storing information
à MapReduce: programming framework that process information
© IT Convergence 2016. All rights reserved.
Hadoop Components: HFDS
à Stores the data on the cluster
à Namenode: block registry
à DataNode: block containers themselves (Datanode)
à HDFS cartoon by Mvarshney
© IT Convergence 2016. All rights reserved.
Hadoop Components: MapReduce
à Retrieves data from HDFS
à A MapReduce program is composed by
à Map() method: performs filtering and sorting of the <key, value> inputs
à Reduce() method: summarize the <key,value> pairs provided by the Mappers
à Code can be written in many languages (Perl, Python, Java etc)
© IT Convergence 2016. All rights reserved.
MapReduce Example
© IT Convergence 2016. All rights reserved.
Code Example
© IT Convergence 2016. All rights reserved.
Code Example
© IT Convergence 2016. All rights reserved.
#2
Hive
© IT Convergence 2016. All rights reserved.
What is Hive?
à An open source data warehouse software on top of Apache Hadoop
à Analyze and query data stored in HDFS
à Structure the data into tables
à Tools for simple ETL
à SQL- like queries (HiveQL)
à Procedural language with HPL-SQL
à Metadata storage in a RDBMS
© IT Convergence 2016. All rights reserved.
Hadoop & Hive Demo
#3
Spark
Revisited: Map Reduce I/O
munz & more #23
Source:	Hadoop	Application	Architecture	Book
Spark
• Orders of magnitude(s) faster than M/R
• Higher level Scala, Java or Python API
• Standalone, in Hadoop, or Mesos
• Principle: Run an operation on all data
-> ”Spark is the new MapReduce”
• See also: Apache Storm, etc
• Uses RDDs, or Dataframes, or Datasets
munz & more #24
https://stackoverflow.com/questions/31508083/difference-between-
dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark
https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
RDDs
Resilient Distributed Datasets
Where do they come from?
Collection of data grouped into named columns.
Supports text, JSON, Apache Parquet, sequence.
Read	in	
HDFS,	Local	FS,	S3,	Hbase
Parallelize	
existing	Collection
Transform
other	RDD
->	RDDs	are	immutable
Lazy Evaluation
munz & more #26
Nothing	is	executed Execution
Transformations:
map(), flatMap(),
reduceByKey(), groupByKey()
Actions:
collect(), count(), first(), takeOrdered(),
saveAsTextFile(), …
http://spark.apache.org/docs/2.1.1/programming-guide.html
map(func) Return	a	new	distributed	dataset	formed	
by	passing	each	element	of	the	source	
through	a	function func.
flatMap(func) Similar	to	map,	but	each	input	item	can	be	
mapped	to	0	or	more	output	items	(so func
should	return	a	Seq rather	than	a	single	
item).
reduceByKey(func,	[numTasks]) When	called	on	a	dataset	of	(K,	V)	pairs,	
returns	a	dataset	of	(K,	V)	pairs	where	the	
values	for	each	key	are	aggregated	using	
the	given	reduce	function func,	which	must	
be	of	type	(V,V)	=>	V.	
groupByKey([numTasks]) When	called	on	a	dataset	of	(K,	V)	pairs,	
returns	a	dataset	of	(K,	Iterable<V>)	pairs.
Transformations
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka
Spark Demo
munz & more #30
Apache Zeppelin Notebook
munz & more #31
Word Count and Histogram
munz & more #32
res =
t.flatMap(lambda line: line.split(" "))
.map(lambda word: (word, 1))
.reduceByKey(lambda a, b: a + b)
res.takeOrdered(5, key = lambda x: -x[1])
Zeppelin Notebooks
munz & more #33
Big Data Compute Service CE
munz & more #34
#4
Kafka
Kafka
Partitioned, replicated commit log
munz & more #36
0 1 2 3 4 … n
Immutable	log:	Messages	with	offset
Producer
Consumer	A
Consumer	B
https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it-
is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
Broker1
Broker2
Broker3
Topic	A	
(1)
Topic	A	
(2)
Topic	A	
(3)
Partition	/
Leader
Repl A	
(1)
Repl A	
(2)
Repl A	
(3)
Producer
Replication	/
Follower
Zoo-
keeper
Zoo-
keeper
Zoo-
keeper
State	/
HA
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/
- 1 topic
- 1 partition
- Contains every article published
since 1851
- Multiple producers / consumers
Example	for	
Stream	/	Table	Duality
Kafka Clients
SDKs Connect Streams
- OOTB:	Java,	Scala
- Confluent:	Python,	C,	
C++
Confluent:
- HDFS	sink,	
- JDBC	source,
- S3	sink
- Elastic	search	sink
- Plugin	.jar	file
- JDBC:	Change	data	
capture	(CDC)
- Real-time	data	ingestion
- Microservices
- KSQL:	SQL	streaming	
engine	for	streaming	
ETL,	anomaly	detection,	
monitoring
- .jar	file	runs	anywhere
High	/	low	level	Kafka	API Configuration	only
Integrate	external	Systems
Data	in	Motion
Stream	/	Table	duality	
REST
- Language	
agnostic
- Easy	for	
mobile	apps
- Easy	to	
tunnel	
through	FW	
etc.
Lightweight
Oracle Event Hub Cloud Service
• PaaS: Managed Kafka 0.10.2
• Two deployment modes
– Basic (Broker and ZK on 1 node)
– Recommended (distributed)
• REST Proxy
– Separate sever(s) running REST Proxy
munz & more #40
Event Hub
munz & more #41
Event Hub Service
munz & more #42
Ports
You must open ports to allow access for
external clients
• Kafka Broker (from OPC connect string)
• Zookeeper with port 2181
munz & more #43
Scaling
munz & more #44
horizontal (up)vertical
Event Hub REST Interface
munz & more #45
https://129.151.91.31:1080/restproxy/topics/a12345orderTopic
Service = Topic
Interesting to Know
• Event Hub topics are prefixed with ID domain
• With Kafka CLI topics with ID Domain can be
created
• Topics without ID domain are not shown in
OPC console
46
#5
Conclusion
TL;DR #bigData #openSource #OPC
OpenSource: entry point to Oracle Big
Data world / Low(er) setup times /
Check for resource usage & limits in
Big Data OPC / BDCS-CE: managed
Hadoop, Hive, Spark + Event hub:
Kafka / Attend a hands-on workshop! /
Next level: Oracle Big Data tools
@EdelweissK
@FrankMunz
www.linkedin.com/in/frankmunz/
www.munzandmore.com/blog
facebook.com/cloudcomputingbook
facebook.com/weblogicbook
@frankmunz
youtube.com/weblogicbook
-> more than 50 web casts
Don’t be
shy J
email:	ekammermann@itconvergence.com
Twitter:	@EdelweissK
3	Membership	Tiers
• Oracle	ACE	Director
• Oracle	ACE
• Oracle	ACE	Associate
bit.ly/OracleACEProgram
500+	Technical	Experts	
Helping	Peers	Globally
Connect:
Nominate	yourself	or	someone	you	know:	acenomination.oracle.com
@oracleace
Facebook.com/oracleaces
oracle-ace_ww@oracle.com
Sign up for Free Trial
http://cloud.oracle.com
1 of 52

Recommended

Oracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCS by
Oracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCSOracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCS
Oracle CODE 2017 San Francisco: Docker on Raspi Swarm to OCCSFrank Munz
2.9K views54 slides
Serverless / FaaS / Lambda and how it relates to Microservices by
Serverless / FaaS / Lambda and how it relates to MicroservicesServerless / FaaS / Lambda and how it relates to Microservices
Serverless / FaaS / Lambda and how it relates to MicroservicesFrank Munz
2.5K views58 slides
Oracle Java Cloud Service JCS (and WebLogic 12c) - What you Should Know by
Oracle Java Cloud Service JCS (and WebLogic 12c) - What you Should KnowOracle Java Cloud Service JCS (and WebLogic 12c) - What you Should Know
Oracle Java Cloud Service JCS (and WebLogic 12c) - What you Should KnowFrank Munz
2.5K views51 slides
Serverless Presentation from Devoxx 2017 Casablanca (AWS Lambda / FaaS / Fn ... by
Serverless Presentation from Devoxx 2017 Casablanca  (AWS Lambda / FaaS / Fn ...Serverless Presentation from Devoxx 2017 Casablanca  (AWS Lambda / FaaS / Fn ...
Serverless Presentation from Devoxx 2017 Casablanca (AWS Lambda / FaaS / Fn ...Frank Munz
561 views41 slides
Grails Monolith to Microservice to FaaS by
Grails Monolith to Microservice to FaaSGrails Monolith to Microservice to FaaS
Grails Monolith to Microservice to FaaSMike Wyszinski
1.2K views27 slides
Docker, Mesos, Spark by
Docker, Mesos, Spark Docker, Mesos, Spark
Docker, Mesos, Spark Qiang Wang
908 views14 slides

More Related Content

What's hot

Wido den hollander cloud stack and ceph by
Wido den hollander   cloud stack and cephWido den hollander   cloud stack and ceph
Wido den hollander cloud stack and cephShapeBlue
1.9K views27 slides
Introduction to Apache CloudStack by David Nalley by
Introduction to Apache CloudStack by David NalleyIntroduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David Nalleybuildacloud
1.7K views32 slides
Big Data in Container; Hadoop Spark in Docker and Mesos by
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and MesosHeiko Loewe
1.8K views42 slides
Scalable On-Demand Hadoop Clusters with Docker and Mesos by
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesosnelsonadpresent
2.7K views20 slides
The Future of SDN in CloudStack by Chiradeep Vittal by
The Future of SDN in CloudStack by Chiradeep VittalThe Future of SDN in CloudStack by Chiradeep Vittal
The Future of SDN in CloudStack by Chiradeep Vittalbuildacloud
1.3K views29 slides
February 2016 HUG: Running Spark Clusters in Containers with Docker by
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerYahoo Developer Network
4.4K views31 slides

What's hot(20)

Wido den hollander cloud stack and ceph by ShapeBlue
Wido den hollander   cloud stack and cephWido den hollander   cloud stack and ceph
Wido den hollander cloud stack and ceph
ShapeBlue1.9K views
Introduction to Apache CloudStack by David Nalley by buildacloud
Introduction to Apache CloudStack by David NalleyIntroduction to Apache CloudStack by David Nalley
Introduction to Apache CloudStack by David Nalley
buildacloud1.7K views
Big Data in Container; Hadoop Spark in Docker and Mesos by Heiko Loewe
Big Data in Container; Hadoop Spark in Docker and MesosBig Data in Container; Hadoop Spark in Docker and Mesos
Big Data in Container; Hadoop Spark in Docker and Mesos
Heiko Loewe1.8K views
Scalable On-Demand Hadoop Clusters with Docker and Mesos by nelsonadpresent
Scalable On-Demand Hadoop Clusters with Docker and MesosScalable On-Demand Hadoop Clusters with Docker and Mesos
Scalable On-Demand Hadoop Clusters with Docker and Mesos
nelsonadpresent2.7K views
The Future of SDN in CloudStack by Chiradeep Vittal by buildacloud
The Future of SDN in CloudStack by Chiradeep VittalThe Future of SDN in CloudStack by Chiradeep Vittal
The Future of SDN in CloudStack by Chiradeep Vittal
buildacloud1.3K views
February 2016 HUG: Running Spark Clusters in Containers with Docker by Yahoo Developer Network
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora... by ClusterHQ
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...
DockerCon 2016 Ecosystem - Everything You Need to Know About Docker and Stora...
ClusterHQ813 views
OpenStack Best Practices and Considerations - terasky tech day by Arthur Berezin
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
Arthur Berezin14.4K views
Cloud OS development by Sean Chang
Cloud OS developmentCloud OS development
Cloud OS development
Sean Chang2.9K views
Introducing Node.js in an Oracle technology environment (including hands-on) by Lucas Jellema
Introducing Node.js in an Oracle technology environment (including hands-on)Introducing Node.js in an Oracle technology environment (including hands-on)
Introducing Node.js in an Oracle technology environment (including hands-on)
Lucas Jellema2.4K views
Lessons Learned from Dockerizing Spark Workloads by BlueData, Inc.
Lessons Learned from Dockerizing Spark WorkloadsLessons Learned from Dockerizing Spark Workloads
Lessons Learned from Dockerizing Spark Workloads
BlueData, Inc. 1.1K views
Build public private cloud using openstack by Framgia Vietnam
Build public private cloud using openstackBuild public private cloud using openstack
Build public private cloud using openstack
Framgia Vietnam9.4K views
Jenkins, jclouds, CloudStack, and CentOS by David Nalley by buildacloud
Jenkins, jclouds, CloudStack, and CentOS by David NalleyJenkins, jclouds, CloudStack, and CentOS by David Nalley
Jenkins, jclouds, CloudStack, and CentOS by David Nalley
buildacloud892 views
Cloud stack design camp on jun 15 by Isaac Chiang
Cloud stack design camp on jun 15Cloud stack design camp on jun 15
Cloud stack design camp on jun 15
Isaac Chiang3.2K views
Docker based Hadoop provisioning - Hadoop Summit 2014 by Janos Matyas
Docker based Hadoop provisioning - Hadoop Summit 2014 Docker based Hadoop provisioning - Hadoop Summit 2014
Docker based Hadoop provisioning - Hadoop Summit 2014
Janos Matyas20.5K views

Similar to Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners by
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersEdelweiss Kammermann
122 views57 slides
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno... by
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...
Accelerating Apache Hadoop through High-Performance Networking and I/O Techno...DataWorks Summit/Hadoop Summit
1.4K views49 slides
The other Apache Technologies your Big Data solution needs by
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsgagravarr
983 views59 slides
Hadoop 3 in a Nutshell by
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a NutshellDataWorks Summit/Hadoop Summit
7.4K views36 slides
Tools and techniques for data science by
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data scienceAjay Ohri
10.7K views98 slides
Overview of big data & hadoop v1 by
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
1K views50 slides

Similar to Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka(20)

The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners by Edelweiss Kammermann
The Open Source and Cloud Part of Oracle Big Data Cloud Service for BeginnersThe Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The Open Source and Cloud Part of Oracle Big Data Cloud Service for Beginners
The other Apache Technologies your Big Data solution needs by gagravarr
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
gagravarr983 views
Tools and techniques for data science by Ajay Ohri
Tools and techniques for data scienceTools and techniques for data science
Tools and techniques for data science
Ajay Ohri10.7K views
Overview of big data & hadoop v1 by Thanh Nguyen
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen1K views
Agile data lake? An oxymoron? by samthemonad
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
samthemonad287 views
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016 by MLconf
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
Arun Rathinasabapathy, Senior Software Engineer, LexisNexis at MLconf ATL 2016
MLconf851 views
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr... by inside-BigData.com
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
Big Data Meets HPC - Exploiting HPC Technologies for Accelerating Big Data Pr...
inside-BigData.com975 views
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos by Lester Martin
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Lester Martin3.7K views
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc... by Data Con LA
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Stream your Operational Data with Apache Spark & Kafka into Hadoop using Couc...
Data Con LA2.4K views
Big Data Hoopla Simplified - TDWI Memphis 2014 by Rajan Kanitkar
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
Rajan Kanitkar1.3K views
Attunity Hortonworks Webinar- Sept 22, 2016 by Hortonworks
Attunity Hortonworks Webinar- Sept 22, 2016Attunity Hortonworks Webinar- Sept 22, 2016
Attunity Hortonworks Webinar- Sept 22, 2016
Hortonworks1.2K views
Big data or big deal by eduarderwee
Big data or big dealBig data or big deal
Big data or big deal
eduarderwee846 views
SQL Engines for Hadoop - The case for Impala by markgrover
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover1.2K views
Eric Baldeschwieler Keynote from Storage Developers Conference by Hortonworks
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
Hortonworks2.2K views
Present and future of unified, portable, and efficient data processing with A... by DataWorks Summit
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
DataWorks Summit319 views
Accelerating Hadoop, Spark, and Memcached with HPC Technologies by inside-BigData.com
Accelerating Hadoop, Spark, and Memcached with HPC TechnologiesAccelerating Hadoop, Spark, and Memcached with HPC Technologies
Accelerating Hadoop, Spark, and Memcached with HPC Technologies
inside-BigData.com1.5K views
Impala turbocharge your big data access by Ophir Cohen
Impala   turbocharge your big data accessImpala   turbocharge your big data access
Impala turbocharge your big data access
Ophir Cohen594 views

More from Frank Munz

Microservices Runtimes by
Microservices RuntimesMicroservices Runtimes
Microservices RuntimesFrank Munz
1.1K views51 slides
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017 by
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017Frank Munz
502 views55 slides
Docker from A to Z, including Swarm and OCCS by
Docker from A to Z, including Swarm and OCCSDocker from A to Z, including Swarm and OCCS
Docker from A to Z, including Swarm and OCCSFrank Munz
1.5K views88 slides
Oracle Service Bus 12c (12.2.1) What You Always Wanted to Know by
Oracle Service Bus 12c (12.2.1) What You Always Wanted to KnowOracle Service Bus 12c (12.2.1) What You Always Wanted to Know
Oracle Service Bus 12c (12.2.1) What You Always Wanted to KnowFrank Munz
6.2K views67 slides
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2... by
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...Frank Munz
21.2K views73 slides
Docker in the Oracle Universe / WebLogic 12c / OFM 12c by
Docker in the Oracle Universe / WebLogic 12c / OFM 12cDocker in the Oracle Universe / WebLogic 12c / OFM 12c
Docker in the Oracle Universe / WebLogic 12c / OFM 12cFrank Munz
8.9K views70 slides

More from Frank Munz(9)

Microservices Runtimes by Frank Munz
Microservices RuntimesMicroservices Runtimes
Microservices Runtimes
Frank Munz1.1K views
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017 by Frank Munz
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017
From Docker Swarm to OCCS and Wercker: Live-hacking at Oracle CODE Mexico 2017
Frank Munz502 views
Docker from A to Z, including Swarm and OCCS by Frank Munz
Docker from A to Z, including Swarm and OCCSDocker from A to Z, including Swarm and OCCS
Docker from A to Z, including Swarm and OCCS
Frank Munz1.5K views
Oracle Service Bus 12c (12.2.1) What You Always Wanted to Know by Frank Munz
Oracle Service Bus 12c (12.2.1) What You Always Wanted to KnowOracle Service Bus 12c (12.2.1) What You Always Wanted to Know
Oracle Service Bus 12c (12.2.1) What You Always Wanted to Know
Frank Munz6.2K views
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2... by Frank Munz
What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...What You Should Know About WebLogic Server 12c (12.2.1.2)  #oow2015 #otntour2...
What You Should Know About WebLogic Server 12c (12.2.1.2) #oow2015 #otntour2...
Frank Munz21.2K views
Docker in the Oracle Universe / WebLogic 12c / OFM 12c by Frank Munz
Docker in the Oracle Universe / WebLogic 12c / OFM 12cDocker in the Oracle Universe / WebLogic 12c / OFM 12c
Docker in the Oracle Universe / WebLogic 12c / OFM 12c
Frank Munz8.9K views
12 Things About WebLogic 12.1.3 #oow2014 #otnla15 by Frank Munz
12 Things About WebLogic 12.1.3 #oow2014 #otnla1512 Things About WebLogic 12.1.3 #oow2014 #otnla15
12 Things About WebLogic 12.1.3 #oow2014 #otnla15
Frank Munz15K views
WebLogic JMX for DevOps by Frank Munz
WebLogic JMX for DevOpsWebLogic JMX for DevOps
WebLogic JMX for DevOps
Frank Munz3.4K views
Oracle Service Bus (OSB) for the Busy IT Professonial by Frank Munz
Oracle Service Bus (OSB) for the Busy IT Professonial Oracle Service Bus (OSB) for the Busy IT Professonial
Oracle Service Bus (OSB) for the Busy IT Professonial
Frank Munz7.1K views

Recently uploaded

Building trust in our information ecosystem: who do we trust in an emergency by
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergencyTina Purnat
85 views18 slides
Audience profile.pptx by
Audience profile.pptxAudience profile.pptx
Audience profile.pptxMollyBrown86
12 views2 slides
UiPath Document Understanding_Day 3.pptx by
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptxUiPathCommunity
95 views25 slides
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲 by
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲Infosec train
7 views6 slides
WEB 2.O TOOLS: Empowering education.pptx by
WEB 2.O TOOLS: Empowering education.pptxWEB 2.O TOOLS: Empowering education.pptx
WEB 2.O TOOLS: Empowering education.pptxnarmadhamanohar21
15 views16 slides
PORTFOLIO 1 (Bret Michael Pepito).pdf by
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdfbrejess0410
7 views6 slides

Recently uploaded(20)

Building trust in our information ecosystem: who do we trust in an emergency by Tina Purnat
Building trust in our information ecosystem: who do we trust in an emergencyBuilding trust in our information ecosystem: who do we trust in an emergency
Building trust in our information ecosystem: who do we trust in an emergency
Tina Purnat85 views
UiPath Document Understanding_Day 3.pptx by UiPathCommunity
UiPath Document Understanding_Day 3.pptxUiPath Document Understanding_Day 3.pptx
UiPath Document Understanding_Day 3.pptx
UiPathCommunity95 views
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲 by Infosec train
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲
𝐒𝐨𝐥𝐚𝐫𝐖𝐢𝐧𝐝𝐬 𝐂𝐚𝐬𝐞 𝐒𝐭𝐮𝐝𝐲
Infosec train7 views
PORTFOLIO 1 (Bret Michael Pepito).pdf by brejess0410
PORTFOLIO 1 (Bret Michael Pepito).pdfPORTFOLIO 1 (Bret Michael Pepito).pdf
PORTFOLIO 1 (Bret Michael Pepito).pdf
brejess04107 views
Serverless cloud architecture patterns by Jimmy Dahlqvist
Serverless cloud architecture patternsServerless cloud architecture patterns
Serverless cloud architecture patterns
Jimmy Dahlqvist17 views
google forms survey (1).pptx by MollyBrown86
google forms survey (1).pptxgoogle forms survey (1).pptx
google forms survey (1).pptx
MollyBrown8614 views
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdf by RIPE NCC
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdfOpportunities for Youth in IG - Alena Muravska RIPE NCC.pdf
Opportunities for Youth in IG - Alena Muravska RIPE NCC.pdf
RIPE NCC9 views
Existing documentaries (1).docx by MollyBrown86
Existing documentaries (1).docxExisting documentaries (1).docx
Existing documentaries (1).docx
MollyBrown8613 views
IGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdf by RIPE NCC
IGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdfIGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdf
IGF UA - Dialog with I_ organisations - Alena Muavska RIPE NCC.pdf
RIPE NCC15 views
We see everywhere that many people are talking about technology.docx by ssuserc5935b
We see everywhere that many people are talking about technology.docxWe see everywhere that many people are talking about technology.docx
We see everywhere that many people are talking about technology.docx
ssuserc5935b6 views
AI Powered event-driven translation bot by Jimmy Dahlqvist
AI Powered event-driven translation botAI Powered event-driven translation bot
AI Powered event-driven translation bot
Jimmy Dahlqvist16 views

Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark and Kafka

  • 1. Open Source Big Data in OPC Edelweiss Kammermann Frank MunzJava One 2017
  • 3. © IT Convergence 2016. All rights reserved.
  • 4. © IT Convergence 2016. All rights reserved. About Me à Computer Engineer, BI and Data Integration Specialist à Over 20 years of Consulting and Project Management experience in Oracle technology. à Co-founder and Vice President of Uruguayan Oracle User Group (UYOUG) à Director of Community of LAOUC à Head of BI Team CMS at ITConvergence à Writer and frequent speaker at international conferences: à Collaborate, OTN Tour LA, UKOUG Tech & Apps, OOW, Rittman Mead BI Forum à Oracle ACE Director
  • 5. © IT Convergence 2016. All rights reserved. Uruguay
  • 6. 6 Dr. Frank Munz •Founded munz & more in 2007 •17 years Oracle Middleware, Cloud, and Distributed Computing •Consulting and High-End Training •Wrote two Oracle WLS and one Cloud book
  • 9. © IT Convergence 2016. All rights reserved. What is Big Data? à Volume: The high amount of data à Variety: The wide range of different data formats and schemas. Unstructured and semi-structured data à Velocity: The speed which data is created or consumed à Oracle added another V in this definition à Value: Data has intrinsic value—but it must be discovered.
  • 10. © IT Convergence 2016. All rights reserved. What is Oracle Big Data Cloud Compute Edition? à Big Data Platform that integrates Oracle Big Data solution with Open Source tools à Fully Elastic à Integrated with Other Paas Services as Database Cloud Service, MySQL Cloud Service, Event Hub Cloud Service à Access, Data and Network Security à REST access to all the funcitonality
  • 11. © IT Convergence 2016. All rights reserved. Big Data Cloud Service – Compute Edition (BDCS-CE)
  • 12. © IT Convergence 2016. All rights reserved. BDCS-CE Notebook: Interactive Analysis à Apache Zeppelin Notebook (version0.7) to interactively work with data
  • 13. © IT Convergence 2016. All rights reserved. What is Hadoop? à An open source software platform for distributed storage and processing à Manage huge volumes of unstructured data à Parallel processing of large data set à Highly scalable à Fault-tolerant à Two main components: à HDFS: Hadoop Distributed File System for storing information à MapReduce: programming framework that process information
  • 14. © IT Convergence 2016. All rights reserved. Hadoop Components: HFDS à Stores the data on the cluster à Namenode: block registry à DataNode: block containers themselves (Datanode) à HDFS cartoon by Mvarshney
  • 15. © IT Convergence 2016. All rights reserved. Hadoop Components: MapReduce à Retrieves data from HDFS à A MapReduce program is composed by à Map() method: performs filtering and sorting of the <key, value> inputs à Reduce() method: summarize the <key,value> pairs provided by the Mappers à Code can be written in many languages (Perl, Python, Java etc)
  • 16. © IT Convergence 2016. All rights reserved. MapReduce Example
  • 17. © IT Convergence 2016. All rights reserved. Code Example
  • 18. © IT Convergence 2016. All rights reserved. Code Example
  • 19. © IT Convergence 2016. All rights reserved. #2 Hive
  • 20. © IT Convergence 2016. All rights reserved. What is Hive? à An open source data warehouse software on top of Apache Hadoop à Analyze and query data stored in HDFS à Structure the data into tables à Tools for simple ETL à SQL- like queries (HiveQL) à Procedural language with HPL-SQL à Metadata storage in a RDBMS
  • 21. © IT Convergence 2016. All rights reserved. Hadoop & Hive Demo
  • 23. Revisited: Map Reduce I/O munz & more #23 Source: Hadoop Application Architecture Book
  • 24. Spark • Orders of magnitude(s) faster than M/R • Higher level Scala, Java or Python API • Standalone, in Hadoop, or Mesos • Principle: Run an operation on all data -> ”Spark is the new MapReduce” • See also: Apache Storm, etc • Uses RDDs, or Dataframes, or Datasets munz & more #24 https://stackoverflow.com/questions/31508083/difference-between- dataframe-in-spark-2-0-i-e-datasetrow-and-rdd-in-spark https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf
  • 25. RDDs Resilient Distributed Datasets Where do they come from? Collection of data grouped into named columns. Supports text, JSON, Apache Parquet, sequence. Read in HDFS, Local FS, S3, Hbase Parallelize existing Collection Transform other RDD -> RDDs are immutable
  • 26. Lazy Evaluation munz & more #26 Nothing is executed Execution Transformations: map(), flatMap(), reduceByKey(), groupByKey() Actions: collect(), count(), first(), takeOrdered(), saveAsTextFile(), … http://spark.apache.org/docs/2.1.1/programming-guide.html
  • 27. map(func) Return a new distributed dataset formed by passing each element of the source through a function func. flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item). reduceByKey(func, [numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. Transformations
  • 30. Spark Demo munz & more #30
  • 32. Word Count and Histogram munz & more #32 res = t.flatMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) res.takeOrdered(5, key = lambda x: -x[1])
  • 34. Big Data Compute Service CE munz & more #34
  • 36. Kafka Partitioned, replicated commit log munz & more #36 0 1 2 3 4 … n Immutable log: Messages with offset Producer Consumer A Consumer B https://www.quora.com/Kafka-writes-every-message-to-broker-disk-Still-performance-wise-it- is-better-than-some-of-the-in-memory-message-storing-message-queues-Why-is-that
  • 37. Broker1 Broker2 Broker3 Topic A (1) Topic A (2) Topic A (3) Partition / Leader Repl A (1) Repl A (2) Repl A (3) Producer Replication / Follower Zoo- keeper Zoo- keeper Zoo- keeper State / HA
  • 38. https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/ - 1 topic - 1 partition - Contains every article published since 1851 - Multiple producers / consumers Example for Stream / Table Duality
  • 39. Kafka Clients SDKs Connect Streams - OOTB: Java, Scala - Confluent: Python, C, C++ Confluent: - HDFS sink, - JDBC source, - S3 sink - Elastic search sink - Plugin .jar file - JDBC: Change data capture (CDC) - Real-time data ingestion - Microservices - KSQL: SQL streaming engine for streaming ETL, anomaly detection, monitoring - .jar file runs anywhere High / low level Kafka API Configuration only Integrate external Systems Data in Motion Stream / Table duality REST - Language agnostic - Easy for mobile apps - Easy to tunnel through FW etc. Lightweight
  • 40. Oracle Event Hub Cloud Service • PaaS: Managed Kafka 0.10.2 • Two deployment modes – Basic (Broker and ZK on 1 node) – Recommended (distributed) • REST Proxy – Separate sever(s) running REST Proxy munz & more #40
  • 41. Event Hub munz & more #41
  • 43. Ports You must open ports to allow access for external clients • Kafka Broker (from OPC connect string) • Zookeeper with port 2181 munz & more #43
  • 44. Scaling munz & more #44 horizontal (up)vertical
  • 45. Event Hub REST Interface munz & more #45 https://129.151.91.31:1080/restproxy/topics/a12345orderTopic Service = Topic
  • 46. Interesting to Know • Event Hub topics are prefixed with ID domain • With Kafka CLI topics with ID Domain can be created • Topics without ID domain are not shown in OPC console 46
  • 48. TL;DR #bigData #openSource #OPC OpenSource: entry point to Oracle Big Data world / Low(er) setup times / Check for resource usage & limits in Big Data OPC / BDCS-CE: managed Hadoop, Hive, Spark + Event hub: Kafka / Attend a hands-on workshop! / Next level: Oracle Big Data tools @EdelweissK @FrankMunz
  • 51. 3 Membership Tiers • Oracle ACE Director • Oracle ACE • Oracle ACE Associate bit.ly/OracleACEProgram 500+ Technical Experts Helping Peers Globally Connect: Nominate yourself or someone you know: acenomination.oracle.com @oracleace Facebook.com/oracleaces oracle-ace_ww@oracle.com
  • 52. Sign up for Free Trial http://cloud.oracle.com