To Have Own Data
Analytics Platform,
Or NOT To
青山エンジニア勉強交流会 April 24, 2017
Satoshi Tagomori (@tagomoris)
Satoshi "Moris" Tagomori
(@tagomoris)
Fluentd, MessagePack-Ruby, Norikra, ...
Treasure Data, Inc.
http://tsuchinoko.dmmlabs.com/?p=1770
At Feb 23, 2015
• To Have Own Data Analytics Platform, Or NOT To,
In Startup Companies:
• "NOT To, in general"
• Data analytics services:
• AWS EMR, Redshift
• Google BigQuery
• Treasure Data
Options In 2017
• On Premise
• Cloudera CDH, Hortonworks HDP, ...
• Services
• AWS EMR, Redshift, Athena, Kinesis Analytics, ...
• Google BigQuery, Cloud Dataflow, Cloud
Dataproc, ...
• MS Azure SQL Data Warehouse, Stream Analytics,
Data Lake Analytics, ...
• Treasure Data
TO HAVE
OR
NOT TO HAVE
?
DO NOT
😝
Anyway,
NO FINE CONCLUSION
IN THIS PRESENTATION
On Premise Platform In Past
• 2011-2014: On-premise Hadoop&Presto cluster
• w/ Fluentd stream processing cluster
• w/ Norikra stream processing
• w/ Web UI (Shib)
https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
To Be Considered
• Distributed Processing Platform
• Data Management
• Process Management
• Platform Management
• Visualization and BI
• Connecting Data
Distributed Processing Platform
• Hadoop, Presto, Spark, Flink, Storm, ...
• + Servers
• EMR, Redshift, Dataproc, ...
• Cost per instances
• BigQuery, Athena, Treasure Data, ....
• Cost per data/queries/...
Data Management
• How to collect data?
• How to ingest data?
• How to manage schema?
• How to move data from here to there?
Process Management
• How to run queries on schedule?
• How to build workflow between queries?
• How to run queries after data ingestion?
• How to move data from the platform to elsewhere
after queries?
Platform Management
• How to upgrade software?
• How to add nodes?
• How to manage failures / downtime?
• How to replace hardware?
• How to switch platforms?
• How to provide compatibility for queries?
Visualization and BI
• How to show query results graphically?
• How to show relations between data graphically?
• How to query data interactively?
Connecting Data
• How to join logs and master data?
• How to join logs and user list?
• How to join logs and CRM data?
• How to push query results to marketing tools/
services?
• How to send notifications using query results?
Additional Topics
• Stream Processing Platform
• Machine Learning Platform
• AI(?) Services
In My Past Case:
• Distributed Processing Platform
• Hadoop & Presto (& Norikra)
• Data Management
• Hive schema & Custom made UI (Shib)
• Managed by engineers of each services
• Process Management
• Custom made query scheduler (ShibUI)
• Platform Management
• By tagomoris
• Visualization, BI: N/A
• Connecting Data: N/A
About Treasure Data
• Distributed Processing Platform: Hive, Presto
• Data Management: Fluentd & Schema-less DB
• Process Management: Digdag / Treasure Workflow
• Platform Management: Automatic
• Visualization and BI: Treasure BI
• Connecting Data: Embulk / Data Connector
😝
Recent Improvements around Data Analytics
• Improvements of CDH/HDP to manage clusters
• Online Upgrade
• Support many processing frameworks
• Many new data processing software/frameworks
• Apache Flink, Apache Arrow, Apache Beam, ...
• Many new services available
• Stream processing, Machine learning, ...
MONEY
• Saving money is important - it's true.
MONEY
• Saving money introduces many issues - it's true!
MONEY
• Money solves many problems - is it true?
Complexity
• Connecting data / processing with applications
• Connecting data / processing with services
• Connecting data / processing with people
Chasing the World
• Many new software / services / platform /
paradigm, day by day
• Data sizes are growing day by day
• Complexity is growing day by day
• A data platform CANNOT live as-is 5 years!
Finding Treasure From Data
• "Data Processing" is:
• NOT the purpose
• just a tool to get something great
• Use developers and their time to find treasures!
TBD
Thank you!
@tagomoris

To Have Own Data Analytics Platform, Or NOT To

  • 1.
    To Have OwnData Analytics Platform, Or NOT To 青山エンジニア勉強交流会 April 24, 2017 Satoshi Tagomori (@tagomoris)
  • 2.
    Satoshi "Moris" Tagomori (@tagomoris) Fluentd,MessagePack-Ruby, Norikra, ... Treasure Data, Inc.
  • 4.
  • 5.
    At Feb 23,2015 • To Have Own Data Analytics Platform, Or NOT To, In Startup Companies: • "NOT To, in general" • Data analytics services: • AWS EMR, Redshift • Google BigQuery • Treasure Data
  • 6.
    Options In 2017 •On Premise • Cloudera CDH, Hortonworks HDP, ... • Services • AWS EMR, Redshift, Athena, Kinesis Analytics, ... • Google BigQuery, Cloud Dataflow, Cloud Dataproc, ... • MS Azure SQL Data Warehouse, Stream Analytics, Data Lake Analytics, ... • Treasure Data
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    NO FINE CONCLUSION INTHIS PRESENTATION
  • 12.
    On Premise PlatformIn Past • 2011-2014: On-premise Hadoop&Presto cluster • w/ Fluentd stream processing cluster • w/ Norikra stream processing • w/ Web UI (Shib) https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
  • 13.
    To Be Considered •Distributed Processing Platform • Data Management • Process Management • Platform Management • Visualization and BI • Connecting Data
  • 14.
    Distributed Processing Platform •Hadoop, Presto, Spark, Flink, Storm, ... • + Servers • EMR, Redshift, Dataproc, ... • Cost per instances • BigQuery, Athena, Treasure Data, .... • Cost per data/queries/...
  • 15.
    Data Management • Howto collect data? • How to ingest data? • How to manage schema? • How to move data from here to there?
  • 16.
    Process Management • Howto run queries on schedule? • How to build workflow between queries? • How to run queries after data ingestion? • How to move data from the platform to elsewhere after queries?
  • 17.
    Platform Management • Howto upgrade software? • How to add nodes? • How to manage failures / downtime? • How to replace hardware? • How to switch platforms? • How to provide compatibility for queries?
  • 18.
    Visualization and BI •How to show query results graphically? • How to show relations between data graphically? • How to query data interactively?
  • 19.
    Connecting Data • Howto join logs and master data? • How to join logs and user list? • How to join logs and CRM data? • How to push query results to marketing tools/ services? • How to send notifications using query results?
  • 20.
    Additional Topics • StreamProcessing Platform • Machine Learning Platform • AI(?) Services
  • 21.
    In My PastCase: • Distributed Processing Platform • Hadoop & Presto (& Norikra) • Data Management • Hive schema & Custom made UI (Shib) • Managed by engineers of each services • Process Management • Custom made query scheduler (ShibUI) • Platform Management • By tagomoris • Visualization, BI: N/A • Connecting Data: N/A
  • 22.
    About Treasure Data •Distributed Processing Platform: Hive, Presto • Data Management: Fluentd & Schema-less DB • Process Management: Digdag / Treasure Workflow • Platform Management: Automatic • Visualization and BI: Treasure BI • Connecting Data: Embulk / Data Connector 😝
  • 23.
    Recent Improvements aroundData Analytics • Improvements of CDH/HDP to manage clusters • Online Upgrade • Support many processing frameworks • Many new data processing software/frameworks • Apache Flink, Apache Arrow, Apache Beam, ... • Many new services available • Stream processing, Machine learning, ...
  • 24.
    MONEY • Saving moneyis important - it's true.
  • 25.
    MONEY • Saving moneyintroduces many issues - it's true!
  • 26.
    MONEY • Money solvesmany problems - is it true?
  • 27.
    Complexity • Connecting data/ processing with applications • Connecting data / processing with services • Connecting data / processing with people
  • 28.
    Chasing the World •Many new software / services / platform / paradigm, day by day • Data sizes are growing day by day • Complexity is growing day by day • A data platform CANNOT live as-is 5 years!
  • 29.
    Finding Treasure FromData • "Data Processing" is: • NOT the purpose • just a tool to get something great • Use developers and their time to find treasures!
  • 30.