• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Building Scalable Big Data Pipelines
 

Building Scalable Big Data Pipelines

on

  • 1,612 views

Talk held at the NoSQL Roadshow on 19.09.2013 in Zurich.

Talk held at the NoSQL Roadshow on 19.09.2013 in Zurich.

Statistics

Views

Total Views
1,612
Views on SlideShare
1,608
Embed Views
4

Actions

Likes
4
Downloads
37
Comments
0

1 Embed 4

https://twitter.com 4

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Building Scalable Big Data Pipelines Building Scalable Big Data Pipelines Presentation Transcript

    • Building Scalable Big Data Pipelines Christian Gügi, Solution Architect 19.09.2013 NOSQL SEARCH ROADSHOW ZURICH
    • AGENDA  Opportunities & Challenges  Integrating Hadoop  Lambda Architecture  Lambda in Practice  Recommendations
    • ABOUT ME  Solution Architect @ YMC  Founder and organizer Swiss Big Data User Group   http://www.bigdata-usergroup.ch/  Contact   christian.guegi@ymc.ch   http://about.me/cguegi   @chrisgugi
    • ABOUT YMC  Founded in 2001  Based in Kreuzlingen, Switzerland  Big Data Analytics, Web Solutions and Mobile Applications  24 experts   Consulting, creation, engineering
    • OPPORTUNITIES &
    • A.  New sources and types from inside & outside organisations   “Internet of things”, sensors, RFID, intelligent devices, etc.   Unstructured information – documents, web logs, email, social media, etc.   Trusted 3rd party sources – industry provider & aggregators, governments “Open Data”, weather, etc. B.  Technology innovations to exploit new world of data   Low cost storage and process power (cloud, on-premise & hybrid)   New software patterns to handle speed & volume, structured and unstructured (In-memory computation, Hadoop, Mapreduce, etc.)   Revolution in user experience, analytics, recommendations BIG DATA – WHAT IS THE BIG DEAL?
    • BIG DATA – CHALLENGES • Align business strategy • Data Management • Privacy protection • Lack of skilled and experienced people • Volume • Velocity • Variety • Veracity Character of data Overwhelming landscape & integration Available talent Organisational issues
    • INTEGRATING
    • TYPICAL RDBMS SZENARIO Data Sources Data Systems Apps DWH RDBMS RDBMS NFS Others BI Web Mobile ETL
    • BIG DATA SZENARIO Data Sources Data Systems Apps DWH RDBMS RDBMS NFS Logs BI Web Mobile Social Media Sensors Hadoop 1) Recommendations, etc. 1)
    • HADOOP ECOSYSTEM
    • LAMBDA
    • ARCHITECTURE  Credits Nathan Marz  Former Engineer at Twitter  Storm, Cascalog, ElephantDB LAMBDA http://www.manning.com/marz/
    • DESIGN PRINCIPLES Lambda Architecture  Human fault-tolerance  Data immutability  Re-computation
    • HUMAN FAULT-TOLERANCE Lambda Architecture  Design for human error   Bugs in code   Accidental data loss   Data corruption  Protect good data, so you can always fix what went wrong
    • DATA IMMUTABILIY Lambda Architecture  Store data in it’s rawest form  Create and read but no update  No data can be lost   To fix the system just delete bad data   Can always revert to a true state
    • DATA IMMUTABILIY Lambda Architecture Name Location Time Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Name Location Time Alice Zurich 2009/03/29 Bob Lucerne 2012/04/12 Tom Bern 2010/04/09 Alice Basel 2013/08/20 Name Location Alice Zurich Bob Lucerne Tom Bern Name Location Alice Basel Bob Lucerne Tom Bern Capturing change traditionally (mutability) Capturing change (immutability)
    • RE-COMPUTATION Lambda Architecture  Always able to re-compute from historical data  Basis for all data systems   query = function(all data) All Data Pre-computed views Query
    • LAYERS Lambda Architecture http://www.ymc.ch/en/lambda-architecture-part-1
    • Lambda in Practice
    • ONLINE MARKETING  Tracking and analytics solution  Improve customer targeting and segmentation  Various reports  Real-time not required
    • OVERVIEW HBase FTP HDFS AdServer Flume log HDFS Campaign Database Sqoop csv csv Up- & Download Hive fs -put Aggregated Data Web PigImpala DWH BI apps Oozie ZooKeeper Cloudera Manager
    • DATA PIPELINE FTP HDFS AdServer Flume log HDFS Campaign Database Sqoop csv csv fs -put M/R Avro Avro Avro Extracting Transformation M/R M/R Loading Tracking Profiles Bulk Importer DWH
    • ADVANTAGES  Extensible – easily add speed layer later on  Complements existing DWH/BI system  ETL phases are decoupled  Reliable   Infrastructure   Each step can be replayed  Scalable   Storage   Processing  Highly available  Ad-hoc analysis right from the beginning
    • RECOMMENDATIONS
    • RECOMMENDATIONS  Not a fixed, one-size-fits-all approach   Adopt to your needs/requirements  Hadoop complements existing systems  How real-time do I need to be?  Immutability and pre-computation are just good ideas!   Store information in rawest format possible   Use a serialization framework (Avro, Thrift, Protocol Buffers)
    • THANK YOU!
    • YMC AG Sonnenstrasse 4 CH-8280 Kreuzlingen Switzerland Photo Credits: Slide 05: Success opportunity achieve by Stephen McCulloch Slide 08: Matrix by Gamaliel Espinoza Macedo. Slide 12: Layers by Katelyn Leblanc Slide 20: Mining For Information by JD Hancock Slide 27: Warning Question by longzijun @chrisgugi CONTACT US christian.guegi@ymc.ch Tel. +41 (0)71 508 24 76 www.ymc.ch