Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Open Source Tools for Big Data


Published on

Exove Extends September 19th, 2017

Open Source Tools for Big Data, Teemu Heikkilä, Emblica

Short introduction to open source tools around big data analytics and development with such tools

Published in: Technology
  • Be the first to comment

Open Source Tools for Big Data

  1. 1. OPEN SOURCE TOOLS FOR BIG DATA Helsinki 19.9.2017 Teemu Heikkilä Emblica
  2. 2. EMBLICA We’re super small company of 5 people We’re into Data Engineering, DevOps and ML We’re hiring!
  3. 3. Let’s start with something simple first
  4. 4. What is “big data”?
  5. 5. Just a buzzword…
  6. 6. …but, still it has a meaning.
  7. 7. 1TB of data, is it BIG?
  8. 8. Volume Velocity Variety
  9. 9. We are not really in Facebook scale but is it worth to talk about big data tools?
  10. 10. Answer: Yes!
  11. 11. Question: Why?
  12. 12. Because what works with petabytes of data, almost certainly works with gigabytes
  13. 13. Helsinki City bike station usage 17M rows of JSON
  14. 14. You will get: Fault tolerance, reliability, scalability and working models of processing data of any amounts
  15. 15. … but it doesn't mean you need fancy frameworks necessarily
  16. 16. History of data processing with free software
  17. 17. NOW200320011997 2006 Google published whitepaper about solving storage problems with web indexing. Carafella and Cutting implemented the white paper as part of the Nutch project GFS HISTORY OF HADOOP Doug Cutting started to develop first version of Lucene at Yahoo! START Cutting moved the NDFS and MapReduce related codebase under new project called Hadoop HADOOP Cutting open sourced Lucene and it was moved under Apache Foundation Mike Cafarella joined with Cutting to start Apache Nutch - project to index whole internet. OPEN SOURCED
  18. 18. Ideas, (whitepapers) DFS MR BigTable Dynamo FOSS Implementations HDFS Hadoop MR HBase Cassandra
  19. 19. (Notable) formats of Big data
  20. 20. ACTIVITY DATA Clickstreams App usage Application specific usage Music listening Video streaming Money usage Credit cards Transactions
  21. 21. SENSOR DATA Locations Spatial data Sensor metrics IoT devices Industrial and consumer Time series
  22. 22. UNSTRUCTURED DATA Machine logs, Unstructured text, natural language Sound, Photos, Video
  23. 23. Use cases What are you using those fancy logos for?
  24. 24. CASE 1: EVENT SOURCING SQL-DATABASES Working legacy systems that used MySQL-database as a realtime data storage. No historical data saved ever. Delete means delete Update means update We could touch the legacy code to save the changes But we don’t have to
  25. 25. Maxwell’s daemon Reads MySQL replication binary log Produces stream of JSON-formatted changes
  26. 26. ?
  27. 27. KAFKA - DISTRIBUTED APPEND-ONLY LOG Kafka was originally developed by LinkedIn, open sourced 2011 Distributed, append-only log Great tool for delivering reliably millions of arbitrary formatted messages Scales by partitioning and adding new nodes (c) Ch.ko123 / CC BY 4.0
  28. 28. (c) Apache Spark + Fast writes (queue/log) + Fast reads (in-memory) - Latency - Reliable event delivery
  29. 29. MATERIALIZING EVENT SOURCES Change stream Change stream Change stream Materialized ‘User’-table Materialized ‘Resource’-table Materialized ‘Usage’-table
  30. 30. APACHE SPARK Originally developed at the University of California, Berkeley's AMPLab General large-scale data processing framework Based on MapReduce architecture but keeps intermediate results in memory instead of saving them to slow disks like Hadoop (c) Ch.ko123 / CC BY 4.0 Supports lot’s of different data sources
 Programming APIs for Scala, Java or Python
  31. 31. EKS-STACK Elasticsearch is based on Lucene but it’s more than just search engine, it can be used to provide real time analytics even for end users, it’s usually used to store the aggregated data Kibana is great tool for the developers and for internal use to discover and analyze the data lying inside ES Spark is used to process the events, produce the needed aggregates and ingest data into Elasticsearch so it can be queried
  32. 32. Screenshot by
  33. 33. Event Collector Processing AnalyticsEventsUser agent CASE 2: EVERY ANALYTICS PIPELINE EVER
  34. 34. Event source (demo)
  35. 35. What are we sampling?
  36. 36. State N ew State Event Session
  37. 37. New session:
 started 07:17:09, duration 0s, OPEN Existing session:
 started 07:17:09, duration 5s, OPEN Existing session:
 started 07:17:09, duration 10s, OPEN Existing session: started 07:17:09, duration 14s, paused 07:17:23, CLOSED
  38. 38. New session:
 started 07:17:09, duration 0s, OPEN Existing session:
 started 07:17:09, duration 5s, OPEN Existing session:
 started 07:17:09, duration 10s, OPEN Existing session: started 07:17:09, duration 14s, paused 07:17:23, CLOSED
  39. 39. You can find me at: @theikkilap Any questions? Thanks! Icons from Font Awesome project