Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data-Driven Development Era and Its Technologies

Developers Summit 2015 Autumn, Data Tech

Data-Driven Development Era and Its Technologies

  1. 1. Data-Driven Development Era and Its Technologies Developers Summit 2015 Autumn (Oct 14, 2015) Satoshi Tagomori (@tagomoris)
  2. 2. Satoshi "Moris" Tagomori (@tagomoris) Fluentd, Norikra, Hadoop, ... Treasure Data, Inc.
  3. 3. HQ Branch
  4. 4. http://www.treasuredata.com/
  5. 5. Main Topics around "Data" • Data collection • Storage • Data processing • Batch distributed processing • Stream processing • Machine Learning • Near real-time query & Data lake • Visualization
  6. 6. Data Analytics Flow Collect Store Process Visualize Data source Reporting Monitoring
  7. 7. Where before What
  8. 8. Using Services or Not • Using services fully-managed: • Google BigQuery & Dataflow • Treasure Data services • Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc • Using your own environment & cluster
  9. 9. Using Services or Not • Using services fully-managed: • Google BigQuery & Dataflow • Treasure Data services • Using services self-managed: • Amazon EMR & Redshift • Google Cloud Dataproc • Using your own environment & cluster a bit more cost extremely less efforts fully controlled by self extremely more efforts less cost less efforts
  10. 10. Using Services or Not: "Use Services!" To concentrate DATA and Analytics, NOT tools
  11. 11. Why should we use services? • About distributed systems: • hard to operate & upgrade • impossible to "small-start" • very hard to hire professional engineer • Data Driven Development: • collect/store data at first! • consider output data at second! • "before building your own environment"
  12. 12. Really? Are you TD guy? • ...Really! • But it requires very long discussions :P • "スタートアップのデータ処理基盤、作るか、使うか"
 http://tsuchinoko.dmmlabs.com/?p=1770
  13. 13. How to choose software/services in Data-Driven Development
  14. 14. "What" decides "How" • Distributed systems are to solve problems • There're many kind of data • There're many problems • Systems solve different problems from each other • There are no "Silver bullet"!
  15. 15. What First, How Second • What do you want to do? • Reporting? Analytics? Recommendation? or ... • What type of data you wan to process? • Stored large log? Stream sensor data? or ... • What is you need as result? • CSV? Spreadsheet? Graph? DB Relation? or ...
  16. 16. How?(just for example) • MapReduce, Tez • Large batch jobs, big JOINs, high stability • Spark • Small/Middle batch jobs, machine learning • Impala, Presto, Drill, Redshift, BigQuery • Near-real-time search, small-to-large analytics • Storm, Spark streaming • Stream data conversion/aggregation
  17. 17. "Processing" is just a part of whole dataflow!
  18. 18. Data Analytics Flow (again) Collect Store Process Visualize Data source Reporting Monitoring
  19. 19. Data Analytics Flow (again) Collect Store Process Visualize Data source Reporting Monitoring
  20. 20. Data Collection • Data Driven Development -> collect at first! • As batch: Data already exists as files • Easily integrated with existing batch systems • Sqoop, Embulk, ... • As stream: Data just generated now • Easily connected with monitoring systems • Without burst network traffic • Flume, Logstash, Fluentd, ...
  21. 21. Fluentd: Support Service by SRA OSS with Treasure Data Released TODAY!
  22. 22. Other Important Topics • Storage: Performance, Availability, Schema management • Apache Hadoop HDFS, Apache HBase, Amazon S3, Cloudera Kudu, ... • Visualization: Functionality, Connectivity, Visibility • Tableau, Pentaho, Many other enterprise products, ... • Distributed Queues: Performance, Stability, Connectivity • Apache Kafka, Amazon Kinesis, ...
  23. 23. Get Familiar with Options NOT to Take Pains about Technology!
  24. 24. Concentrate DATA and Analytics, NOT tools. Thanks!

×