Here's the presentation I gave at the KW Big Data Peer2Peer meetup held at Communitech on 3rd November 2015.
The deck served as a backdrop to the interactive session
http://www.meetup.com/KW-Big-Data-Peer2Peer/events/226065176/
The scope was to drive an architectural conversation about :
o What it actually takes to get the data you need to add that one metric to your report/dashboard?
o What's it like to navigate the early conversations of an analytic solution?
o How is one technology selected over another and how do those selections impact or define other selections?
2. Who Am I – George Long
■ Software Architect living in KW area
■ 3 decades of software engineering in UK and North America
– Speciality is distributed systems design with Big Data, Cloud and NoSQL
technologies
■ 5th
degree black belt in taekwondo
■ Email : master.geo.san@gmail.com
■ Linkedin : https://ca.linkedin.com/in/mastergeosan
3. Overview
■ What it actually takes to get the data you need to add that one metric to your
report/dashboard?
■ What's it like to navigate the early conversations of an analytics solution?
■ How is one technology selected over another and how do those selections impact or
define other selections?
4. Agenda
■ Setting the solution space
■ Sample big data use cases
■ Towards a Big Data Culture
■ Hadoop Tools
– ETL tools for ingest
– Tools for data manipulation
– Publishing Results
6. Ingest Use Case Prerequisites
■ What is your use case – what are you trying to do?
– Are there multiple asks?
– Who are the end-users?
■ Is this a new ask or refinements to existing workflows?
– How responsive is your organisation to change?
– Is there an existing team to manage the solution?
■ How will the data source yield their information?
– What are the network protocols? Frequency, Volume etc.
■ Is the data structured or unstructured?
– Are the formats stable?
■ Do you understand the data life cycle of your data?
– Retention policies, privacy, access control
■ Performance & Availability
– How quickly are results required?
– What are the tolerances for system failure? SLAs?
11. UC1-LOG – Server log analysis
■ Analysis of data-at-rest
■ Massive volume of logs are captured by syslog and require aggregation by hour/day
■ Results are produced every hour
12. UC2-MSTR – Monitoring Streaming Events
■ Analysis of data-in-motion
■ Extends UC1-LOG by requiring certain log events are used to notify service impact
– Ie. Generate actionable events from RT analysis of the service logs
■ Results are produced continuously
13. UC3-PPD – Publishing Production Data
■ Refining data for hosting by services
■ Datasets are massaged and published for consumption by customer facing services
■ Data is merged, refined and published for service hosts to consume
■ Process repeats on demand to accommodate new datasets
20. File transfer to HDFS
■ Simple file loads, via the following techniques
– Explicit loading via HDFS commands, eg
■ Hadoop fs put <file>
– Mounting HDFS as Fuse enabled filesystem
■ Note that filesystem is append only writes
■ Note - Manually loaded filesets require manual tracking and clean-up
21. DB Exchange with Hadoop - Apache Sqoop
■ Apache Sqoop is a tool for transferring data between Hadoop and relational
databases. Use Sqoop to import data from a MySQL or Oracle database into HDFS,
run MapReduce on the data, and then export the data back into an RDBMS. Sqoop
automates these processes, using MapReduce to import and export the data in
parallel with fault-tolerance
■ It offers two-way replication with both snapshots and incremental updates.
■ Note - Sqoop requires detailed schema knowledge and synchronisation
configuration of database accounts. DB needs to be up at time of sqooping.
22. Log Collection – Apache Flume
■ Flume is distributed system for collecting log data from many sources, aggregating
it, and writing it to HDFS. It is designed to be reliable and highly available, while
providing a simple, flexible, and intuitive programming model based on streaming
data flows. Flume provides extensibility for online analytic applications that process
data stream in situ.
■ maintains a central list of ongoing data flows, stored redundantly in Zookeeper
■ See : http://www.lopakalogic.com/articles/hadoop-articles/log-files-flume-hive/
23. Queue Ingest - Apache Kafka
■ Apache Kafka is a fast, distributed publish-subscribe messaging system. It is
designed to provide high throughput persistent messaging that’s scalable and
allows for parallel data loads into Hadoop. Its features include the use of
compression to optimize IO performance and mirroring to improve availability,
scalability and to optimize performance in multiple-cluster scenarios.
– Queues decouple systems: Both statically and in time
■ See http://www.slideshare.net/gwenshap/kafka-for-dbas
■ Note - Kafka can buffer source data so the availability of the Hadoop platform can
be relaxed
24. RT Streaming – Storm or Spark
■ Both Storm and Spark Streaming are open-source frameworks for distributed
stream processing
■ Processing Model, Latency
– Storm processes incoming events one at a time in RT
– Spark Streaming batches up events that arrive within a short time window
before processing them with several seconds of latency
■ Fault Tolerance, Data Guarantees
– Storm tracks individual records and guarantees that each record will be
processed at least once, but allows duplicates
– Spark Streaming provides better support for stateful computation that is
fault tolerant.
25. ■ Batch Layer, which has all the processed batch data from the past
■ Speed Layer or RT feed of similar or same information
■ Serving layer combines the two for transparent access
27. Java MR (Map/Reduce)
■ Map/Reduce functionality is accessible via java.
■ Full applications can be developed although the higher level constructs such as pig
and Hive should also be considered as the traditional java development cycles are
usually longer than for the scripting routes.
28. PIG for transforming unstructured data
■ Pig is a scripting language (pig latin) for processing unstructured datasets. (pigs eat
anything)
– Contrast with HIVE
■ Pig Latin programs run in a distributed fashion on a cluster (programs are complied
into Map/Reduce jobs and executed using Hadoop).
29. Apache Hive
■ Provides SQL like access to structured HDFS datasets
– Contrast with pig
■ Queries are converted to internal M/R, Tex, Spark jobs (similar to pig)
■ Indexing is supported
■ CRUD support with ACID functionality was added
■ Query language can be extended with User Defined Functions (UDFs)
30. Mahout – Machine Learning
■ Mahout is a library of scalable machine-learning algorithms, implemented on top of
Apache Hadoop® and using the MapReduce paradigm.
■ Mahout supports four main data science use cases:
– Collaborative filtering – mines user behavior and makes product
recommendations (e.g. Amazon recommendations)
– Clustering – takes items in a particular class (such as web pages or
newspaper articles) and organizes them into naturally occurring groups, such
that items belonging to the same group are similar to each other
– Classification – learns from existing categorizations and then assigns
unclassified items to the best category
– Frequent itemset mining – analyzes items in a group (e.g. items in a
shopping cart or terms in a query session) and then identifies which items
typically appear together
32. Hbase – Hadoop’s NoSQL DB
■ HBase provides near real-time, random read and write access to tables (or to be
more accurate ‘maps’) storing billions of rows and millions of columns.
– Contrast with Cassandra
■ HBase runs on the Hadoop cluster without the needed for additional cluster
deployments
■ Access is dependent on the availability of the cluster
33. Apache Cassandra – distributed DB
■ Apache Cassandra is a massively scalable open source non-relational database that
offers continuous availability, linear scale performance, operational simplicity and
easy data distribution across multiple data centers and cloud availability zones.
– Contrast with Cassandra
■ Hadoop deployments are tied to the data center. Replicate the results to multiple
sites via the eventual consistency of Cassandra.