Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20160331 sa introduction to big data pipelining berlin meetup 0.3

256 views

Published on

An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines

Published in: Technology
  • Be the first to comment

20160331 sa introduction to big data pipelining berlin meetup 0.3

  1. 1. Simon Ambridge Data Pipelines With Spark & DSE An Introduction To Building Agile, Flexible and Scalable Big Data and Data Science Pipelines
  2. 2. Certified Apache Cassandra and DataStax enthusiast who enjoys explaining that the traditional approaches to data management just don’t cut it anymore in the new always on, no single point of failure, high volume, high velocity, real time distributed data management world. Previously 25 years implementing Oracle relational data management solutions. Certified in Exadata, Oracle Cloud, Oracle Essbase, Oracle Linux and OBIEE simon.ambridge@datastax.com @stratman1958 Simon Ambridge Pre-Sales Solution Engineer, Datastax UK
  3. 3. Big Data Pipelining: Why Analytics? • To be able to react to customers faster and with more accuracy • To reduce the business risk through more accurate understanding of the market • Optimise return on marketing investment via better targeted campaigns • Faster time to market with the right products at the right time • Improve efficiency – commerce, plant and people Recent survey found that more than half of respondents wanted: 85% wanted analytics to handle ‘real-time’ data changing at <1s intervals
  4. 4. Big, Static Data Fast, Streaming Data Big Data Pipelining: Classification Big Data Pipelines can mean different things to different people Repeated analysis on a static but massive dataset • Typically an element of research – e.g. genomics, clinical trial, demographic data • Something that is typically repetitive, iterative, shared amongst data scientists for analysis Real-time analytics on streaming data • Typically an industrialised processes – e.g. sensors, tick data, bioinformatics, transactional data, real-time personalisation • Something that is happening in real-time that usually cannot be dropped or lost
  5. 5. Static Datasets All You Can Eat? Really.
  6. 6. Static Analytics: Traditional Approach Repeated iterations, at each stage Run/debug cycle can be slow Sampling Modeling InterpretTuning Reporting Re-sample Typical traditional ‘static’ data analysis model Data Results
  7. 7. Analytics: Traditional Scaling DATA DATA DATA Small datasets, small servers Large datasets, large servers Big datasets, big servers
  8. 8. Static Analytics: Scale Up Challenges • Sampling and analysis often run on a single machine • CPU and memory limitations – finite resources on a single machine • Offers limited sampling of a large dataset because of data size limitations • Multiple iterations over large datasets is frequently not an ideal approach
  9. 9. Static Analytics: Big Data Problems • Data really is getting Big! • Data is getting bigger! • The number of data sources is exploding! • More data is arriving faster! Scaling up is becoming impractical – physical limits • The validity of the analysis becomes obsolete, faster • Analysis too slow to get any real ROI from the data
  10. 10. Big Data Analytics: Big Data Needs We need scalable infrastructure + distributed technologies • Data volumes can be scaled – we can distribute the data across multiple low-cost machines or cloud instances • Faster processing – distributed smaller datasets • More complex processing – distributed across multiple machines • No single point of failure
  11. 11. Big Data Analytics: DSE Delivers Building a distributed data processing framework can be a complex task! It needs to be: • Scalable • Have fast in-memory processing • Able to handle real-time or streaming data feeds • Able to handle high throughput and low-latency • Ideally be able to handle ad-hoc queries • Ideally be replicated across multiple data centers for resiliency
  12. 12. DataStax Enterprise: Standard Edition DataStax Enterprise • Certified Cassandra – delivers trusted, tested and certified versions of Cassandra ready for production environments. • Expert Support – answers and assistance from the Cassandra experts for all production needs. • Enterprise Security – supplies full protection for sensitive data. • Automatic Management Services – automates key maintenance functions to keep the database running smoothly. • OpsCenter – provides advanced management and monitoring functionality for production applications.
  13. 13. DataStax Enterprise • Advanced Analytics – provides ability to run real- time and batch analytic operations on Cassandra data, as well as integrate DSE with external Hadoop deployments. • Enterprise Search – supplies built-in enterprise and distributed search capabilities on Cassandra data. • In-Memory Option – delivers all the benefits of Cassandra to in-memory computing. • Workload Isolation – allows for analytics and search functions to run separately from transactional workloads, with no need to ETL data to different systems. DataStax Enterprise: Max Edition
  14. 14. Intro To Cassandra: THE Cloud Database What is Apache Cassandra? • Originally started at Facebook in 2008 • Top level Apache project since 2010 • Open source distributed database • Clusters can handle large amounts of data (PB’s) • Performant at high velocity • Extremely resilient: • Across multiple data centres • No single point of failure • Continuous Availability, disaster avoidance • Enterprise Cassandra platform from Datastax
  15. 15. Intro To Spark: THE Analytics Engine What is Apache Spark? • Started at UC Berkeley in 2009 • Apache Project since 2010 • Distributed in-memory processing • Rich Scala, Java and Python APIs • Fast - 10x-100x faster than Hadoop MapReduce • 2x-5x less code than R • Batch and streaming analytics • Interactive shell (REPL) • Tightly integrated with DSE
  16. 16. Spark: Dayton Gray Sort Contest October 2014 Daytona Gray benchmark tests how fast a system can sort 100 TB of data (1 trillion records) • Previous world record held by Hadoop MapReduce cluster of 2100 nodes, in 72 minutes • Spark completed the benchmark in 23 minutes on just 206 EC2 nodes. All the sorting took place on disk (HDFS), without using Spark’s in-memory cache (3X faster using 10X fewer machines) • Spark also sorted 1 PB (10 trillion records) on 190 machines in under 4 hours. This beats previously results based on Hadoop MapReduce in 16 hours on 3800 machines (4X faster using 20X fewer machines)
  17. 17. DataStax Enterprise: Analytics Integration Cassandra Cluster Spark Cluster ETL Spark Cluster • Tight integration • Data locality • Microsecond response times X • Apache Cassandra for Distributed Persistent Storage • Integrated Apache Spark for Distributed Real-Time Analytics • Analytics nodes close to data - no ETL required X • Loose integration • Data separate from processing • Millisecond response times “Latency  when  transferring  data  is  unavoidable.  The  trick  is  to   reduce  the  latency  to  as  close  to  zero  as  possible…”
  18. 18. Intro To Parquet: Quick History What is Parquet? • Started at Twitter and Cloudera in 2013 • Databases traditionally store information in rows and are optimized for working with one record at a time • Columnar storage systems optimised to store data by column • A compressed, efficient columnar data representation • Compression schemes can be specified on a per-column level • Allows complex data to be encoded efficiently • Netflix - 7 PB of warehoused data in Parquet format • Not as compressed as ORC (Hortonworks) but faster read/analysis
  19. 19. Intro To Akka: Distributed Apps What is Akka? • Open source toolkit first released in 2009 • Simplifies the construction of highly concurrent and distributed Java apps • Makes it easier to build concurrent, fault-tolerant and scalable applications • Based on the ‘actor’ model • Highly performant event-driven programming • Hierarchical - each actor is created and supervised by its parent • Process failures treated as events handled by an actor's supervisor • Language bindings exist for both Java and Scala
  20. 20. Big Data Pipelining: Static Datasets Valid data pipeline analysis methods must be: Auditable • Reproducible – essential for any science, so too for Data Science • Documented – important to understand the how and why • Controlled • Suitable for version control • Collaborative • Easily accessible
  21. 21. Intro To Notebooks: Features What are Notebooks? • Drive your data analysis from the browser • Increasingly popular • Highly interactive • Tight integration with Apache Spark • Handy tools for analysts: • Reproducible visual analysis • Code in Scala, CQL, SparkSQL • Charting – pie, bar, line etc • Extensible with custom libraries
  22. 22. Intro To Notebooks: Features
  23. 23. Big Data Pipelining: Static Datasets Example architecture & requirements 1. Optimised source data format 2. Distributed in-memory analytics 3. Interactive and flexible data analysis tool 4. Persistent data store 5. Visualisation tools
  24. 24. Big Data Pipelining: Pipeline Flow ADAM Notebook Notebook Datastore Visualisation Example: Genome research platform (SHAR3)
  25. 25. Big Data Pipelining: Pipeline Scalability • Add more (physical or virtual) nodes as required to add capacity • Container tools can ease configuration management • Scale out quickly
  26. 26. Big Data Pipelining: Pipeline Process Flow 3. Persistent data storage 2. Interactive, flexible and reproducible analysis 1. Source data 4. Visualise and analyse
  27. 27. Analytics: Static Data Pipeline Process • No longer an iterative process constrained by hardware limitations • Now a more scalable, resilient, dynamic, interactive process, easily shareable Analyse The new model for large-scale static data analysis Share X Load
  28. 28. Real-Time Datasets If it’s Not “Now”, Then It’s Probably Already Too Late
  29. 29. Big Data Pipelining: Real-Time Analytics • Capture, prepare, and process fast streaming data • Needs different approach from traditional batch processing • Has to be at the speed of now – cannot wait even for seconds • Immediate insight & instant decisions - offers huge commercial and engineering advantages What problem are we trying to solve?
  30. 30. Big Data Analytics: Streams Data tidal waves!Netflix • Ingests Petabytes of data per day • Over 1 TRILLION transactions per day (>10 m per second) into DSE Data streams? Data deluge?
  31. 31. Big Data Pipelining: Real-Time Use Cases Social media • Commercial value - trending products, sentiment analysis • Reaction time is critical as the value of data quickly diminishes over time e.g. Twitter, Facebook Sensor data (IoT) • Critical safety and monitoring • Missed data could have significant safety implications • Utility billing, engineering management e.g. power plant, vehicles Examples of use cases for streaming analytics…
  32. 32. Big Data Pipelining: Real-Time Use Cases Transactional data • Missed data could have huge financial implications e.g. market data • Credit card transactions, fraud detection– if it’s not now, its too late User Experience • Personalising the user experience • Commercial benefit to customise the user experience • Netflix, Spotify, eBay, mobile apps etc. Examples of use cases for streaming analytics…
  33. 33. Big Data Pipelining: Real-Time architecture Analytics in real-time at scale demand fast processing, with low latencies Common solution is to use in-memory distributed architecture Increasingly using a technology stack comprising Kafka, Spark and Cassandra • Scalable • Distributed • Resilient Streaming analytics architecture - what do we need?
  34. 34. Intro To Kafka: Quick History What is Apache Kafka? • Originally developed by LinkedIn • Open sourced since 2011 • Top level project since 2012 • Enterprise support from Confluent • Fast?- single Kafka broker handles hundreds of MB/s of reads/writes from thousands of clients • Scalable? - can be elastically and transparently expanded without downtime. Data streams are distributed over a cluster of machines • Durable? - messages persisted on disk, replicated within the cluster, prevent data loss • Powerful? - each broker can handle TBs of messages without performance impact. • Distributed? - modern cluster-centric design, strong durability and fault-tolerance
  35. 35. Intro To Kafka: Architecture How Does Kafka Work? Producers send messages to the Kafka cluster, which in turn serves them up to consumers • Kafka maintains feeds of messages in categories called topics • Processes that publish messages to Kafka are called producers • Processes that subscribe to topics and process the feed are called consumers • A Kafka cluster is comprised of one or more servers called a broker • Java API, other languages supported
  36. 36. Intro To Kafka: Streaming Flow How Does Kafka Work With Spark? • Publish-subscribe messaging system implemented as a replicated commit log • Messages are simply byte arrays so can store any object in any format • Each topic partition stored as a log (an ordered set of messages) • Each message in a partition is assigned a unique offset • Consumers are responsible to track their location in each topic log Spark consumes messages as a stream, in micro batches, saved as RDD’s
  37. 37. DataStax Enterprise: Streaming Schematic Sensor Network Signal Aggregation Messaging Queue Sensor Data Queue Management Broker Broker Collection Data Processing & Storage
  38. 38. DataStax Enterprise: Streaming Analytics Real-time Analytics Data Processing & Storage Near real-time, batch Analytics Analytics / BI !$£€! Personalisation Actionable insight Monitoring
  39. 39. DataStax Enterprise: Multi-DC Uses DC: EUROPEDC: USA Real-time active-activegeo-replication across physical datacentres 4 3 25 1 4 3 25 1 8 1 2 3 4 5 6 7 1 2 3 OLTP: Cassandra 5 4 Analytics: Cassandra + Spark Replication Replication Workload separation via virtual datacentres
  40. 40. Real-Time Analytics: DSE Multi-DC Workload Management and Separation With DSE Analytics / BI Analytics Datacentre OLTP Datacentre 100% Uptime, Global Scale OLTP Real-Time Analytics Mixed Load OLTP and Analytics Platform Replication Replication JDBC ODBC Separation of OLTP from Analytics Social Media IoT Personalisation & Persistence Personalisation !$£€! Actionable insight Monitoring App, Web
  41. 41. OLTP Feed OLTP Layer 100% Uptime, Global Scale High Velocity Ingestion Layer Lambda &Big Data: DSE &Hadoop Data Stores - Active& Legacy Batch Analytics Analytics / BI • Scalable • Fault-tolerant • Fast JDBC ODBC Real-Time Analytic/Integration Layer Social Media IoT Web, App Oracle IBMSAP OLTP Feed
  42. 42. OLTP Feed OLTP Feed Big Data Use Case: DSE &SAP Data Stores - Active& Legacy Hot Data Storage / Query Analytics / BI SAP/Hana Smart Data Access OLTP Layer 100% Uptime, Global Scale High Velocity Ingestion Layer Social Media IoT • Scalable • Fault-tolerant • Fast Oracle IBMSAP Real-Time Analytic/Integration Layer Web, App JDBC ODBC
  43. 43. Thank you!

×