EMERGING TRENDS IN DATA
ANALYTICS
Wei-Chiu Chuang, Ph.D.
HDFS Lead Engineer, Cloudera | Apache Hadoop PMC/Committer
© 2019 Cloudera, Inc. All rights reserved. 2
Hadoop HDFS
Lead Engineer
Committer/PMC
Founding
member
© 2019 Cloudera, Inc. All rights reserved. 3
”BIG DATA” IS PASSÉ
https://trends.google.com/trends/explore?date=today%205-y&q=Big%20Data,Data%20Analytics
LANDSCAPE OF COMMERCIAL OPEN
SOURCE DATA ANALYTICS SOFTWARE
© 2019 Cloudera, Inc. All rights reserved. 5
A YEAR OF TECTONIC SHIFT
Merged
Acquired by
Acquired by
© 2019 Cloudera, Inc. All rights reserved. 6
OPEN SOURCE DATA ANALYTICS SOFTWARE UNICORNS
© 2019 Cloudera, Inc. All rights reserved. 7
SUB-BILLION, UNICORNS-TO-BE
3.7
5
4
Lucidworks neo4j H2O.AI
Valuation
Valuation ($100 M)
DATA ENGINEER COMMUNITY
© 2019 Cloudera, Inc. All rights reserved. 9
IS HADOOP DEAD?
© 2019 Cloudera, Inc. All rights reserved. 10
MOST ACTIVE
VISITS AND
DOWNLOADS
• Hadoop web pages
are the most popular
among all Apache
projects.
Apache Software Foundation Annual Report 2019
© 2019 Cloudera, Inc. All rights reserved. 11
MAPREDUCE IS
DEAD;
LONG LIVE HDFS
AND YARN
• Stack Overflow
Trends
• HDFS and YARN are
mature
© 2019 Cloudera, Inc. All rights reserved. 12
• Hadoop is the on-prem platform for Big Data Analytics.
• Like Linux. Boring, but it’s the foundation.
© 2019 Cloudera, Inc. All rights reserved. 13
COMPUTE ENGINES
IN HADOOP-
ECOSYSTEM
• Stack Overflow
Trends
• Spark most popular
• Hive stable
• MapReduce, Pig and
Storm are dead.
https://insights.stackoverflow.com/trends?tags=apache-
spark%2Chadoop%2Chive%2Cmapreduce%2Capache-pig%2Capache-
storm&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 14
BIG DATA: YOUR
NAME IS SQL
• Stack Overflow Trends
• Hive was the most
popular until 2018.
• SparkSQL grew fastest
until 2018.
• Cloud: BigQuery is more
popular than Redshift
© 2019 Cloudera, Inc. All rights reserved. 16
BATCH VS REAL-TIME
© 2019 Cloudera, Inc. All rights reserved. 17
KAFKA
• Message broker >>
stream processing
© 2019 Cloudera, Inc. All rights reserved. 18
STREAM
PROCESSING
• Stack Overflow Trends
⎯ (exclude Kafka
Streams)
• Flink grows fastest;
Beam too
• Spark Streaming
declining
• Storm is dead
https://insights.stackoverflow.com/trends?tags=spark-streaming%2Capache-flink%2Capache-
storm%2Capache-beam&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-
SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 20
SPARK
• Stack Overflow Trends
• Spark no longer the cool kid
• You will write in PySpark or
SparkSQL.
• Spark Streaming is declining
• Very little people develop ML
with Spark.
⎯ Wait for Spark 3.0?
• batch >> streaming https://insights.stackoverflow.com/trends?tags=apache-spark%2Cpyspark%2Cspark-
streaming%2Cspark-dataframe%2Cpyspark-sql%2Capache-spark-sql%2Capache-spark-
mllib&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 21
LANGUAGE
• Python > Java > C++
> Go
https://insights.stackoverflow.com/trends?tags=python%2Cgo%2Cjava%2Cc%2B%2B
© 2019 Cloudera, Inc. All rights reserved. 22
TRENDY
TECHNOLOGIES
• Stack Overflow Trends
https://insights.stackoverflow.com/trends?tags=tensorflow%2Ckubernetes%2Capache-
spark%2Capache-kafka%2Cdocker&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-
SwDQ2_AT4kpIvBn1NjuLYhdGc
© 2019 Cloudera, Inc. All rights reserved. 23
SUMMARY
• Deep learning (Tensorflow)
• Micro services (Kubernetes, Docker, Kafka)
• Batch > Streaming, but Streaming is gaining traction
• Python
DEVELOPER COMMUNITY
© 2019 Cloudera, Inc. All rights reserved. 25
APACHE SOFTWARE FOUNDATION
20 years anniversary
300+ projects
48 incubating projects
FY2019:
187 k commits
3215 committers
© 2019 Cloudera, Inc. All rights reserved. 26
CNCF
• 6 graduated projects
• 16 incubating projects
• 18 sandbox projects
• Last year
⎯ 141 k commits last year
⎯ 7647 committers
© 2019 Cloudera, Inc. All rights reserved. 27
APACHE VS CNCF
• Big Data, Database, Cloud
• Contributors are individuals
• DevOps tools
• Contributors are associated with the
companies
© 2019 Cloudera, Inc. All rights reserved. 28
IMPACT OF CLDR-HWX MERGER
• 63% Hadoop commits are made by Cloudera employees in 2019.
• Community development
• Bad news: Apache Ambari, Apache Sentry
• Good news: Hive support for Kudu, Ranger support for Impala
© 2019 Cloudera, Inc. All rights reserved. 29
DEVELOPER COMMUNITY MOVING TO ASIA
• HBase
⎯ HBaseCon Asia
• Hadoop
⎯ 1st ever Hadoop Meetup in China
• China is the third largest contributor to
CNCF projects.
⎯ 3 projects were born in China
Apache github visits
© 2019 Cloudera, Inc. All rights reserved. 30
MACHINE LEARNING
● Apache Hadoop Submarine
● Submarine Project.
● Distributed machine learning platform
● algorithm development, model batch training, model incremental training, model online services and
model management
⎯ Available since: 3.2.0 (As part of YARN)
⎯ Become top level subproject: 0.2.0 (Separate release)
⎯ Lots of new stuff in 0.3.0.
https://hadoop.apache.org/submarine/
MEASURING THE HEALTH OF
DEVELOPER COMMUNITY
© 2019 Cloudera, Inc. All rights reserved. 32
CREATED/RESOLVED JIRAS
© 2019 Cloudera, Inc. All rights reserved. 33
NUMBER OF CONTRIBUTORS
https://www.openhub.net/p/apache-spark https://www.openhub.net/p/mongodb
Apache Spark MongoDB
© 2019 Cloudera, Inc. All rights reserved. 35
DIVERSITY (AFFILIATION) OF KUBERNETES DEVELOPERShttps://k8s.devstats.cncf.io/d/8/company-statistics-by-repository-group?orgId=1&var-period=m&var-
metric=prs&var-repogroup_name=All&var-companies=All
© 2019 Cloudera, Inc. All rights reserved. 37
APACHE APEX (2016 - 2018)
The story of an abandoned project
• “Enterprise-grade unified stream and batch
processing engine.”
• Founded in April, 2016.
• Backed by DataTorrent, collapsed May,
2018 after raising $23.9 million.
© 2019 Cloudera, Inc. All rights reserved. 38
APACHE APEX
• https://reporter.apache.org/wizar
d/statistics?apex
• Last new PMC was on 2018-05-
15.
• Last new committer was 2017-
10-19.
• Community Health Score (Chi):
-4.28 (Action required!)
WILL CLOUD KILL OPEN SOURCE?
© 2019 Cloudera, Inc. All rights reserved. 40
OPEN SOURCE VS.
PROPRIETARY
• Popularity of open source data
systems is about to take over
proprietary systems for the first
time.
• Why open source?
⎯ Free
⎯ Innovation
⎯ Industry standard
Source: https://db-engines.com/en/ranking_osvsc
© 2019 Cloudera, Inc. All rights reserved. 41
AMAZON’ED: CLOUD PROVIDERS THREATENING OSS VENDORS
Redis Labs
AGPL
Redis Source Available
License
MongoDB
AGPL
Server Side Public
License
Confluent
Apache 2.0
Confluent Community
License
Cockroach
LabsApache 2.0
Business Source
License
© 2019 Cloudera, Inc. All rights reserved. 42
ALL IS NOT LOST
Cloudera will be 100% open source
• Hadoop, Spark, Kafka, ...
Apache Software License 2.0
• Cloudera Manager
• Cloudera Data Science Workbench
• Cloud service
Proprietary à AGPL
© 2019 Cloudera, Inc. All rights reserved. 43
CLOUD VENDORS’ OSS STRATEGY
Amazon AWS
EMR, Open Distro for Elasticsearch,
DocumentDB, Amazon MSK
Microsoft Azure
Partnership:
HDInsight, Azure Databricks, Azure Red
Hat Openshift
Google GCP
Partnership:
Confluent, DataStax, Elastic, InfluxData,
MongoDB, Neo4j, Redis Labs
CLOUD NATIVE, CLOUD FIRST
© 2019 Cloudera, Inc. All rights reserved. 46
KUBERNETES IS THE NEW OPERATING SYSTEM
KubeCon attendance
YuniKorn
FUTURE: HYBRID CLOUD
© 2019 Cloudera, Inc. All rights reserved. 63
HYBRID CLOUD IS THE NEW NORM
• Public cloud deployments will capture most of growth.
• On-prem deployments will still exist, for niche use cases.
⎯ Regulation (FinServ, Healthcare)
⎯ High density (>100 TB per node)
⎯ Specialized hardware (100Gbps NIC, GPU, FPGA, NVMe, Vector Engine)
TAKEAWAY
© 2019 Cloudera, Inc. All rights reserved. 65
TAKEAWAY
Big Data à Data Analytics
Commercial open source software market is booming
Don’t bet on a single open source software
Cloud vendors will find a balance with OSS vendors
Hybrid cloud
WE ARE HIRING!
https://www.cloudera.com/careers/teams/engineering.html
(remote positions available)
THANK YOU
Scan me!

Emerging trends in data analytics

  • 1.
    EMERGING TRENDS INDATA ANALYTICS Wei-Chiu Chuang, Ph.D. HDFS Lead Engineer, Cloudera | Apache Hadoop PMC/Committer
  • 2.
    © 2019 Cloudera,Inc. All rights reserved. 2 Hadoop HDFS Lead Engineer Committer/PMC Founding member
  • 3.
    © 2019 Cloudera,Inc. All rights reserved. 3 ”BIG DATA” IS PASSÉ https://trends.google.com/trends/explore?date=today%205-y&q=Big%20Data,Data%20Analytics
  • 4.
    LANDSCAPE OF COMMERCIALOPEN SOURCE DATA ANALYTICS SOFTWARE
  • 5.
    © 2019 Cloudera,Inc. All rights reserved. 5 A YEAR OF TECTONIC SHIFT Merged Acquired by Acquired by
  • 6.
    © 2019 Cloudera,Inc. All rights reserved. 6 OPEN SOURCE DATA ANALYTICS SOFTWARE UNICORNS
  • 7.
    © 2019 Cloudera,Inc. All rights reserved. 7 SUB-BILLION, UNICORNS-TO-BE 3.7 5 4 Lucidworks neo4j H2O.AI Valuation Valuation ($100 M)
  • 8.
  • 9.
    © 2019 Cloudera,Inc. All rights reserved. 9 IS HADOOP DEAD?
  • 10.
    © 2019 Cloudera,Inc. All rights reserved. 10 MOST ACTIVE VISITS AND DOWNLOADS • Hadoop web pages are the most popular among all Apache projects. Apache Software Foundation Annual Report 2019
  • 11.
    © 2019 Cloudera,Inc. All rights reserved. 11 MAPREDUCE IS DEAD; LONG LIVE HDFS AND YARN • Stack Overflow Trends • HDFS and YARN are mature
  • 12.
    © 2019 Cloudera,Inc. All rights reserved. 12 • Hadoop is the on-prem platform for Big Data Analytics. • Like Linux. Boring, but it’s the foundation.
  • 13.
    © 2019 Cloudera,Inc. All rights reserved. 13 COMPUTE ENGINES IN HADOOP- ECOSYSTEM • Stack Overflow Trends • Spark most popular • Hive stable • MapReduce, Pig and Storm are dead. https://insights.stackoverflow.com/trends?tags=apache- spark%2Chadoop%2Chive%2Cmapreduce%2Capache-pig%2Capache- storm&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 14.
    © 2019 Cloudera,Inc. All rights reserved. 14 BIG DATA: YOUR NAME IS SQL • Stack Overflow Trends • Hive was the most popular until 2018. • SparkSQL grew fastest until 2018. • Cloud: BigQuery is more popular than Redshift
  • 15.
    © 2019 Cloudera,Inc. All rights reserved. 16 BATCH VS REAL-TIME
  • 16.
    © 2019 Cloudera,Inc. All rights reserved. 17 KAFKA • Message broker >> stream processing
  • 17.
    © 2019 Cloudera,Inc. All rights reserved. 18 STREAM PROCESSING • Stack Overflow Trends ⎯ (exclude Kafka Streams) • Flink grows fastest; Beam too • Spark Streaming declining • Storm is dead https://insights.stackoverflow.com/trends?tags=spark-streaming%2Capache-flink%2Capache- storm%2Capache-beam&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y- SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 18.
    © 2019 Cloudera,Inc. All rights reserved. 20 SPARK • Stack Overflow Trends • Spark no longer the cool kid • You will write in PySpark or SparkSQL. • Spark Streaming is declining • Very little people develop ML with Spark. ⎯ Wait for Spark 3.0? • batch >> streaming https://insights.stackoverflow.com/trends?tags=apache-spark%2Cpyspark%2Cspark- streaming%2Cspark-dataframe%2Cpyspark-sql%2Capache-spark-sql%2Capache-spark- mllib&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y-SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 19.
    © 2019 Cloudera,Inc. All rights reserved. 21 LANGUAGE • Python > Java > C++ > Go https://insights.stackoverflow.com/trends?tags=python%2Cgo%2Cjava%2Cc%2B%2B
  • 20.
    © 2019 Cloudera,Inc. All rights reserved. 22 TRENDY TECHNOLOGIES • Stack Overflow Trends https://insights.stackoverflow.com/trends?tags=tensorflow%2Ckubernetes%2Capache- spark%2Capache-kafka%2Cdocker&fbclid=IwAR1InUJDJPPoUDDfhYtiiaxo21RVqKO2y- SwDQ2_AT4kpIvBn1NjuLYhdGc
  • 21.
    © 2019 Cloudera,Inc. All rights reserved. 23 SUMMARY • Deep learning (Tensorflow) • Micro services (Kubernetes, Docker, Kafka) • Batch > Streaming, but Streaming is gaining traction • Python
  • 22.
  • 23.
    © 2019 Cloudera,Inc. All rights reserved. 25 APACHE SOFTWARE FOUNDATION 20 years anniversary 300+ projects 48 incubating projects FY2019: 187 k commits 3215 committers
  • 24.
    © 2019 Cloudera,Inc. All rights reserved. 26 CNCF • 6 graduated projects • 16 incubating projects • 18 sandbox projects • Last year ⎯ 141 k commits last year ⎯ 7647 committers
  • 25.
    © 2019 Cloudera,Inc. All rights reserved. 27 APACHE VS CNCF • Big Data, Database, Cloud • Contributors are individuals • DevOps tools • Contributors are associated with the companies
  • 26.
    © 2019 Cloudera,Inc. All rights reserved. 28 IMPACT OF CLDR-HWX MERGER • 63% Hadoop commits are made by Cloudera employees in 2019. • Community development • Bad news: Apache Ambari, Apache Sentry • Good news: Hive support for Kudu, Ranger support for Impala
  • 27.
    © 2019 Cloudera,Inc. All rights reserved. 29 DEVELOPER COMMUNITY MOVING TO ASIA • HBase ⎯ HBaseCon Asia • Hadoop ⎯ 1st ever Hadoop Meetup in China • China is the third largest contributor to CNCF projects. ⎯ 3 projects were born in China Apache github visits
  • 28.
    © 2019 Cloudera,Inc. All rights reserved. 30 MACHINE LEARNING ● Apache Hadoop Submarine ● Submarine Project. ● Distributed machine learning platform ● algorithm development, model batch training, model incremental training, model online services and model management ⎯ Available since: 3.2.0 (As part of YARN) ⎯ Become top level subproject: 0.2.0 (Separate release) ⎯ Lots of new stuff in 0.3.0. https://hadoop.apache.org/submarine/
  • 29.
    MEASURING THE HEALTHOF DEVELOPER COMMUNITY
  • 30.
    © 2019 Cloudera,Inc. All rights reserved. 32 CREATED/RESOLVED JIRAS
  • 31.
    © 2019 Cloudera,Inc. All rights reserved. 33 NUMBER OF CONTRIBUTORS https://www.openhub.net/p/apache-spark https://www.openhub.net/p/mongodb Apache Spark MongoDB
  • 32.
    © 2019 Cloudera,Inc. All rights reserved. 35 DIVERSITY (AFFILIATION) OF KUBERNETES DEVELOPERShttps://k8s.devstats.cncf.io/d/8/company-statistics-by-repository-group?orgId=1&var-period=m&var- metric=prs&var-repogroup_name=All&var-companies=All
  • 33.
    © 2019 Cloudera,Inc. All rights reserved. 37 APACHE APEX (2016 - 2018) The story of an abandoned project • “Enterprise-grade unified stream and batch processing engine.” • Founded in April, 2016. • Backed by DataTorrent, collapsed May, 2018 after raising $23.9 million.
  • 34.
    © 2019 Cloudera,Inc. All rights reserved. 38 APACHE APEX • https://reporter.apache.org/wizar d/statistics?apex • Last new PMC was on 2018-05- 15. • Last new committer was 2017- 10-19. • Community Health Score (Chi): -4.28 (Action required!)
  • 35.
    WILL CLOUD KILLOPEN SOURCE?
  • 36.
    © 2019 Cloudera,Inc. All rights reserved. 40 OPEN SOURCE VS. PROPRIETARY • Popularity of open source data systems is about to take over proprietary systems for the first time. • Why open source? ⎯ Free ⎯ Innovation ⎯ Industry standard Source: https://db-engines.com/en/ranking_osvsc
  • 37.
    © 2019 Cloudera,Inc. All rights reserved. 41 AMAZON’ED: CLOUD PROVIDERS THREATENING OSS VENDORS Redis Labs AGPL Redis Source Available License MongoDB AGPL Server Side Public License Confluent Apache 2.0 Confluent Community License Cockroach LabsApache 2.0 Business Source License
  • 38.
    © 2019 Cloudera,Inc. All rights reserved. 42 ALL IS NOT LOST Cloudera will be 100% open source • Hadoop, Spark, Kafka, ... Apache Software License 2.0 • Cloudera Manager • Cloudera Data Science Workbench • Cloud service Proprietary à AGPL
  • 39.
    © 2019 Cloudera,Inc. All rights reserved. 43 CLOUD VENDORS’ OSS STRATEGY Amazon AWS EMR, Open Distro for Elasticsearch, DocumentDB, Amazon MSK Microsoft Azure Partnership: HDInsight, Azure Databricks, Azure Red Hat Openshift Google GCP Partnership: Confluent, DataStax, Elastic, InfluxData, MongoDB, Neo4j, Redis Labs
  • 40.
  • 41.
    © 2019 Cloudera,Inc. All rights reserved. 46 KUBERNETES IS THE NEW OPERATING SYSTEM KubeCon attendance
  • 42.
  • 43.
  • 44.
    © 2019 Cloudera,Inc. All rights reserved. 63 HYBRID CLOUD IS THE NEW NORM • Public cloud deployments will capture most of growth. • On-prem deployments will still exist, for niche use cases. ⎯ Regulation (FinServ, Healthcare) ⎯ High density (>100 TB per node) ⎯ Specialized hardware (100Gbps NIC, GPU, FPGA, NVMe, Vector Engine)
  • 45.
  • 46.
    © 2019 Cloudera,Inc. All rights reserved. 65 TAKEAWAY Big Data à Data Analytics Commercial open source software market is booming Don’t bet on a single open source software Cloud vendors will find a balance with OSS vendors Hybrid cloud
  • 47.
  • 48.