Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

To Have Own Data Analytics Platform, Or NOT To

6,815 views

Published on

Aoyama Engineer Meetup at F@N Communication

Published in: Software
  • Be the first to comment

To Have Own Data Analytics Platform, Or NOT To

  1. 1. To Have Own Data Analytics Platform, Or NOT To 青山エンジニア勉強交流会 April 24, 2017 Satoshi Tagomori (@tagomoris)
  2. 2. Satoshi "Moris" Tagomori (@tagomoris) Fluentd, MessagePack-Ruby, Norikra, ... Treasure Data, Inc.
  3. 3. http://tsuchinoko.dmmlabs.com/?p=1770
  4. 4. At Feb 23, 2015 • To Have Own Data Analytics Platform, Or NOT To, In Startup Companies: • "NOT To, in general" • Data analytics services: • AWS EMR, Redshift • Google BigQuery • Treasure Data
  5. 5. Options In 2017 • On Premise • Cloudera CDH, Hortonworks HDP, ... • Services • AWS EMR, Redshift, Athena, Kinesis Analytics, ... • Google BigQuery, Cloud Dataflow, Cloud Dataproc, ... • MS Azure SQL Data Warehouse, Stream Analytics, Data Lake Analytics, ... • Treasure Data
  6. 6. TO HAVE OR NOT TO HAVE ?
  7. 7. DO NOT
  8. 8. 😝
  9. 9. Anyway,
  10. 10. NO FINE CONCLUSION IN THIS PRESENTATION
  11. 11. On Premise Platform In Past • 2011-2014: On-premise Hadoop&Presto cluster • w/ Fluentd stream processing cluster • w/ Norikra stream processing • w/ Web UI (Shib) https://www.slideshare.net/tagomoris/lambda-architecture-using-sql-hadoopcon-2014-taiwan
  12. 12. To Be Considered • Distributed Processing Platform • Data Management • Process Management • Platform Management • Visualization and BI • Connecting Data
  13. 13. Distributed Processing Platform • Hadoop, Presto, Spark, Flink, Storm, ... • + Servers • EMR, Redshift, Dataproc, ... • Cost per instances • BigQuery, Athena, Treasure Data, .... • Cost per data/queries/...
  14. 14. Data Management • How to collect data? • How to ingest data? • How to manage schema? • How to move data from here to there?
  15. 15. Process Management • How to run queries on schedule? • How to build workflow between queries? • How to run queries after data ingestion? • How to move data from the platform to elsewhere after queries?
  16. 16. Platform Management • How to upgrade software? • How to add nodes? • How to manage failures / downtime? • How to replace hardware? • How to switch platforms? • How to provide compatibility for queries?
  17. 17. Visualization and BI • How to show query results graphically? • How to show relations between data graphically? • How to query data interactively?
  18. 18. Connecting Data • How to join logs and master data? • How to join logs and user list? • How to join logs and CRM data? • How to push query results to marketing tools/ services? • How to send notifications using query results?
  19. 19. Additional Topics • Stream Processing Platform • Machine Learning Platform • AI(?) Services
  20. 20. In My Past Case: • Distributed Processing Platform • Hadoop & Presto (& Norikra) • Data Management • Hive schema & Custom made UI (Shib) • Managed by engineers of each services • Process Management • Custom made query scheduler (ShibUI) • Platform Management • By tagomoris • Visualization, BI: N/A • Connecting Data: N/A
  21. 21. About Treasure Data • Distributed Processing Platform: Hive, Presto • Data Management: Fluentd & Schema-less DB • Process Management: Digdag / Treasure Workflow • Platform Management: Automatic • Visualization and BI: Treasure BI • Connecting Data: Embulk / Data Connector 😝
  22. 22. Recent Improvements around Data Analytics • Improvements of CDH/HDP to manage clusters • Online Upgrade • Support many processing frameworks • Many new data processing software/frameworks • Apache Flink, Apache Arrow, Apache Beam, ... • Many new services available • Stream processing, Machine learning, ...
  23. 23. MONEY • Saving money is important - it's true.
  24. 24. MONEY • Saving money introduces many issues - it's true!
  25. 25. MONEY • Money solves many problems - is it true?
  26. 26. Complexity • Connecting data / processing with applications • Connecting data / processing with services • Connecting data / processing with people
  27. 27. Chasing the World • Many new software / services / platform / paradigm, day by day • Data sizes are growing day by day • Complexity is growing day by day • A data platform CANNOT live as-is 5 years!
  28. 28. Finding Treasure From Data • "Data Processing" is: • NOT the purpose • just a tool to get something great • Use developers and their time to find treasures!
  29. 29. TBD Thank you! @tagomoris

×