Beyond a Big Data Pilot: Building a Production Data Infrastructure - StampedeCon 2014

  • 255 views
Uploaded on

At StampedeCon 2014, Stephen O’Sullivan (Silicon Valley Data Science) presented "Beyond a Big Data Pilot: Building a Production Data Infrastructure." …

At StampedeCon 2014, Stephen O’Sullivan (Silicon Valley Data Science) presented "Beyond a Big Data Pilot: Building a Production Data Infrastructure."

Creating a data architecture involves many moving parts. By examining the data value chain, from ingestion through to analytics, we will explain how the various parts of the Hadoop and big data ecosystem fit together to support batch, interactive and realtime analytical workloads.

By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including data acquisition, ingestion, storage, data services, analytics and data management. Most importantly, we’ll leave you with a framework for understanding these options and making choices.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
255
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Beyond a Big Data Pilot: Building a Production Data Infrastructure StampedeCon 29 May 2014, St. Louis Stephen O’Sullivan (@steveos) strata.svds.com @SVDataScience
  • 2. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 2 Stephen O’Sullivan Distinguished Architect
  • 3. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Beyond a Big Data Pilot: Building a Production Data Infrastructure Creating a data architecture involves many moving parts. By examining the data value chain, from ingestion through to analytics, we will explain how the various parts of the Hadoop and big data ecosystem fit together to support batch, interactive and realtime analytical workloads. By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including data acquisition, ingestion, storage, data services, analytics and data management. Most importantly, we’ll leave you with a framework for understanding these options and making choices. 3
  • 4. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 4 Key-Value Columnar Graph Document GENERAL
  • 5. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 5 UP OR OUT? Different use cases put different demands on the data infrastructure • UC1 • UC2 • UC3 • UC4 • UCn Increasing cost per unit of capability from scale-up architectures causes rationing of resources. Only the most valuable use cases are pursued. Data Resource Usage Value scale-out cost UC 1 UC2 UC3 UC4 UCn
  • 6. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 6 THE DATA VALUE CHAIN Acquire Ingest Process Persist Integrate Analyze Expose
  • 7. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 7 BUILDING A DATA PLATFORM External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 8. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 8 Acquisition: from internal and external data sources External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 9. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 9 Ingestion offline and real-time Processing External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 10. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 10 Persistence External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 11. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 11 Data Services Exposing data to applications External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Service s
  • 12. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 12 Analytics batch and real-time processing External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Services
  • 13. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Data Management Data security, operations, lineage, quality, and metadata management 13 External Systems Data Acquisition Internal Data Sources Data Management Security, Operations, Data Quality, Meta Data Management and Data Lineage Analytics Data Ingestion Data Repository External Data Sources Persistence Offline Processing Real Time Processing Batch Processing Data Service s
  • 14. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Use Case • Collection in-store sales transactions in near real-time • Provide near real-time dashboards of sales transaction (roll up by store, region etc) • Provide ad-hoc access to this data as soon as its collected (ie low latency, and fine grain) 14
  • 15. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 15 APPLICATION SERVERS DATA CENTER A DATA CENTER B BI Server http BI Server http FORTUNE 500 RETAIL COMPANY Enabling Near Real- time Sales Transactions
  • 16. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 16 APPLICATION SERVERS DATA CENTER A DATA CENTER B CFS BI Server http CFS BI Server http FORTUNE 500 RETAIL COMPANY Enabling Near Real- time Sales Transactions
  • 17. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Data Acquisition – Make sure you have the correct network access in place – Will depend on the data and your policies. • Data Ingestion – Make sure the solution you choose can scale out. Apache flume is a good example of this – Make sure your not point to point. In Flume, Storm, and Kafka you can configure forks etc But you may need to handle duplicate data Ready to go into Production? • Data Acquisition – Can you see the “collectors” (internal or external)? – Do you need to encrypt the data (internally or externally)? • Data Ingestion – Can you handle the traffic to the “collectors”? – Redundant / self healing paths into the cluster? 17
  • 18. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Data Repository – Make sure you have a way to address it, as it will happen – Hadoop and Cassandra makes it very easy to add nodes. If you cannot add nodes be prepared to drop data or stop processes or both – If’s its very wide data, and you query a subset of the columns, Parquet would be a good choice. If you would like to be able to version your data schema, Avro is a good choice. • Data Services – Build a restful service to access the data – What is data resiliency I hear you ask.. Ready to go into Production? • Data Repository – Can you handle out of order data? – Can you scale the cluster for data volume spikes and/or processing spikes ? – Should I just store plan text (compressed)? • Data Services – Do applications need to access this data? – Do you have data resiliency? 18
  • 19. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 19 Stovepipe: One-to-one relationship from data source to product Hard Failure: If the data source is broken, so is the app. Multi-sourced: Redundancy of overlapping data sources makes your products more resilient Graceful Degradation: If a data source breaks, there is a backup and your app continues to function Production data services abstract the probabilistic integration of overlapping data sources. We call this model a Data Mesh: DATA RESILIENCY Products Data Sources Broken Data Sources Data Services
  • 20. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Analytics – There are a few to choose from Hive, Impala, Spark SQL, and HAWQ (and growing). Share the same meta store. Some are faster than others (depends on the type of query) – See if the current tool works with your distro. You can also look at Platfora, Datameer, and Karmasphere – Yes you do, but the benefit is you still have access to the raw data for the advance data analyst or data science – Now you have a data lake you can take advantage of doing deep analytics on the data without moving it out. Ready to go into Production? • Analytics – Which is the right SQL on Hadoop solution (for me)? – Which BI tool should I use? – Do I still need to set up business views of the data? – What about deep analytics? 20
  • 21. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 21 Analytics tools Analytics
  • 22. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience • Data Management – It’s getting there.. At the query level using Hive or Impala you can use Apache Sentry or Apache Knox – There are other 3rd party tools like Dataguise that lets you do things like encryption at rest, or masking – Using “Fair Scheduler” will help you manage your jobs SLA’s – A 3rd party product by Pepper Data can help with this too (and a little more) Ready to go into Production? • Data Management – Security (who can see what?) – Can you meet your SLA when other jobs / queries are running? – What monitoring do you have in place? – Cluster failover? 22
  • 23. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience Adding more use cases • I’m I duplicating data? • Can I reuse the infrastructure I’ve already created? • Do have enough room in the cluster (space/processing)? • Will I impact the SLA’s of jobs/queries currently running? 23
  • 24. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 24 HIGH LEVEL ARCHITECTURE Oracle Stats Collection Pulling data over jdbc Sending data to Graphite Writing data to HDFS Oracle Stats CollectionOracle Stats Collection FORTUNE 500 RETAIL COMPANY Enabling Real Time Database Monitoring
  • 25. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience 25 FORTUNE 500 RETAIL COMPANY Enabling Log Collection & Search statsd http APPLICATION SERVERS Log Search http
  • 26. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience questions 26 Yes, We’re Hiring svds.com/join-us
  • 27. © 2014 Silicon Valley Data Science LLC All Rights Reserved. @SVDataScience THANK YOU Stephen O’Sullivan @steveos 27