Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Essential Data Engineering for Data Scientist


Published on

Here's a special Data Engineering Survival Checklist required to deploy a successful Data Science project.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Essential Data Engineering for Data Scientist

  1. 1. Essential Data Engineering for Data Scientist
  2. 2. Me, myself, and I: Valentyn Kropov • Sr. Big Data Solutions Architect. • 14 years of work experience with Databases. • 4 years in Big Data. • Big Data Consulting Lead at SoftServe (20+ Engineers and Architects). • Founder of Kyiv Big Data Community (600+ people). webinar
  3. 3. Agenda 1. Level of Involvement 2. Choosing the Right Tools (Distribution of Hadoop) 3. RDBMS vs. NoSQL 4. NoSQL Data Modeling 5. Deployment 6. On-Premises vs. Cloud 7. Scalability and Performance 8. Storage webinar
  4. 4. Level of Involvement
  5. 5. Who Should be Leading Data Science Projects?
  6. 6. Project Stages from Data Engineering Perspective 1. Statement of work 2. Requirements 3. Architecture 4. Infrastructure 5. Data modeling/ETL 6. Data Science modeling webinar
  7. 7. Involvement: Checklist 1. You’re the boss! 2. You have a right to demand the infrastructure you need. 3. But, you need to have perfect argumentation. 4. And I’ll show it to you right now.  webinar
  8. 8. Choosing the Right Tools
  9. 9. Big Data Landscape 2016
  10. 10. Big Data Analytics Reference Architecture A modern-integrated approach for solving Big Data/Business Analytics needs across multiple verticals and domains. All Data Real-time Data Processing Data Acquisition and Storing DataIntegration Enterprise Data Warehousing Data Management (Governance, Security, Quality, MDM) Analytics Reporting and Analysis Predictive Modeling Data Mining Data Lake (Landing, Exploration and Archiving) UX and Visualization Applications Application data Media data: images, video, etc Social data Enterprise content data Machine, sensor, log data Docs and archives data Customer Analytics Marketing Analytics Web/Mobile/ Social Analytics IT Operational Analytics Fraud and Risk Analytics Complex Event Processing Real-time Query and Search
  11. 11. Hortonworks vs. Cloudera vs. MapR Hortonworks Cloudera MapR File system HDFS HDFS MapR FS Non-Hadoop Access NFS Fuse-DFS Direct Access NFS Data Integration Services TalenD - - Data Analysis Framework - Data Fu - Software Abstraction Layer - - Apache Cascading Web Access WebHDFS HTTPFS - Parallel Query Execution Tez (Stinger) Impala - Installation Ambari Cloudera Manager - Security - Sentry - Monitoring Gangila/Nagios - - Non-mapr Reduce Tasks YARN YARN - webinar
  12. 12. Or Even More: IBM, Oracle, Amazon, … 1. IBM: Big R (set of Data Science algorithms) and Big SQL (SQL-like interface to data). 2. Oracle: Big Data appliance/connectors. 3. Amazon: Elastic MapReduce.
  13. 13. Choosing the Right Tools: Example (Description) Data Volume: • 270-300 Web Servers (Apache HTTPD) • 447 392 events per minute • 644 245 094 events / day • ~100-250 bytes per event • 150GB of data per day Log Types: • Apache HTTPD access log • Apache HTTPD error log • Service log (CPU, RAM, I/O, Disk) • Application server servlet log Retention: • Last 30 days: Raw data • Last 24 hours: per minute aggregation • Whole period: per hour aggregation
  14. 14. Choosing the Right Tools: Example (Marketecture)
  15. 15. Choosing the Right Tools: Example (Description - data) Access log: - frank [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326 Error log: [Sun Mar 7 20:58:27 2004] [info] [client] (104)Connection reset by peer: client stopped connection before send body completed [Sun Mar 7 21:16:17 2004] [error] [client] File does not exist: /home/httpd/twiki/view/Main/WebHome Vmstat procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 0 0 305416 260688 29160 2356920 2 2 4 1 0 0 6 1 92 2 0 iostat Linux 2.6.32-100.28.5.el6.x86_64 (dev-db) 07/09/2011 avg-cpu: %user %nice %system %iowait %steal %idle 5.68 0.00 0.52 2.03 0.00 91.76 webinar
  16. 16. Choosing the Right Tools: Example (Description - data) webinar
  17. 17. Choosing the Right Tools: Example (Proof-of-Concept) 4200 events / second webinar
  18. 18. Choosing the Right Tools: Example (Compression & speed) Compression Ratio Access Speed webinar
  19. 19. Choosing the Right Tools: Example (Accurate sizing)
  20. 20. Choosing the Right Tools: Checklist 1. Fastest random access to the data: Cloudera (Impala). 2. Universal (and fast!) access to data: MapR (MapR FS). 3. Data Integration: Hortonworks (built-in TalenD). 4. Never trust papers, always double check: Proof-of-Concept. 5. Lastly, ensure you have rightsizing and check every element of the chain! webinar
  21. 21. RDBMS vs. NoSQL
  22. 22. RDBMS vs. NoSQL webinar
  23. 23. It’s Not Necessarily Always Black and White! • Traditional-relational • Extended-relational • Non-relational • Lambda architecture (Hybrid) • Data refinery (Hybrid) webinar
  24. 24. SoftServe Lambda Architecture Accelerator • Lambda architecture – Is a highly scalable and reliable data processing architecture based on Twitter successful experience in Big Data and Analytics. • Supports majority of use cases: Real-time analytics, data discovery, and business reports. • SoftServe’s pre-built Lambda architecture stack accelerates customer’s Time to Market (TTM) to 15-20+ man/month.
  25. 25. RDBMS vs NoSQL: Checklist 1. RDBMS: Structured data, moderate velocity and volume (up to TB), with complex transactions. 2. NoSQL: Unstructured data, high velocity or volume (up to PB+), with simple transactions. 3. Hybrid, Lambda, Refinery: Something in-between.
  26. 26. NoSQL Data Modeling
  27. 27. NoSQL: How is it Different than RDBMS? 1. Write operations are cheap. 2. Less transactions and is less consistent. 3. Read operations are blazingly fast! webinar
  28. 28. NoSQL: Two Main Rules to Remember 1. Spread Data evenly around the cluster. 2. Minimize the number of partitions read. webinar
  29. 29. RDBMS: Queries Around Model Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z. webinar
  30. 30. NoSQL: Model Around Queries! Q1: People who live in state X. Q2: People who live in city Y. Q3: People who live at address Z. People_by_States state - Partition / Primary Key country first_name last_name city street_name1 street_name2 street_number People_by_City city - Partition / Primary Key country first_name last_name state street_name1 street_name2 street_number People_by_FullAddress country, city, state, street_name1 – Partition / Primary Key first_name last_name street_name2 street_number webinar
  31. 31. Data Modeling: Checklist 1. In NoSQL, you can have a table for each query, and it is totally OK, don’t save disk space! (sacrifice cheap writes for the fastest reads). 2. There are (almost) no secondary indexes in NoSQL, only primary. 3. Pick up correct primary (partitioning) key to read only one partition per request. webinar
  32. 32. Deployment
  33. 33. Deployment Defined In short, deployment is the litmus paper for a project that defines the level of maturity. And, the overall project success depends on it. webinar
  34. 34. Deployment Stages 1. Bootstrapping: Create VM’s and hosts. 2. Provisioning: Install software like Hadoop. 3. Configuration: Initial parameters and data. 4. Validation: Verify installation. webinar
  35. 35. Deployment: Manual vs. Automation “Architectural Support for DevOps in a Neo-Metropolis BDaaS Platform” © Valentyn Kropov, Serge Haziyev, Rick Kazman, Hong-Mei Chen Time Savings of: 89.75%! webinar
  36. 36. Deployment: Automation Provisioning, configuration, and verification (Ansible, Cloudera Director, Cloudera Manager, Ambari, Cloud Break) Bootstrapping (Terraform) VM1 VM2 VM3 VM4 VM5 VM4 AWS / Open Stack / Google Cloud webinar
  37. 37. Deployment: Automation (Hadoop Cluster) 1. Bootstrapping: HoshiCorp Terraform. 2. Provisioning & Configuration: Cloudera Director. 3. Validation: Cloudera Manager API. webinar
  38. 38. Service Layout & Memory Allocation hadoop-clusters-like-a-boss/
  39. 39. Automation: Checklist 1. Deployment should be fully automated (Terraform and Ansible). 2. Ensure service layout is correct (master nodes, worker nodes, and edge nodes). 3. Double check to see if enough memory has been given for nodes (~64-128GB for master/edge nodes, ~256-512GB for data/workers nodes). webinar
  40. 40. On-Premises vs. Cloud
  41. 41. On-Premises (real hardware somewhere in your building or data center) 1. Highest data privacy (Regulations and sensitive data). 2. Quickest access to data (Latency). 3. Best velocity (Transfer rates). 4. Existing Hardware. 5. Control over resource usage. webinar
  42. 42. Cloud (Amazon, Azure, etc.) 1. Efficient cost-reduction. 2. Universal access. 3. Flexibility. 4. Choice of applications. 5. Built-in maintenance and support. 6. Scalability! webinar
  43. 43. Hybrid 1. Hybrid: a combination of on-premises and cloud. 2. On-premises: sensitive information and data for high-performance access. 3. Cloud: non-sensitive data. webinar
  44. 44. On-Premises vs. Cloud 1. Oracle ExaData ~ $1.000.000 2. Biggest instance in Amazon EC2 (40CPU) ~ 50 years! webinar
  45. 45. On-Premises vs. Cloud: Checklist 1. On-premises: If customer has existing unused hardware, has predicted data volume growth, or has strong data security requirements. 2. Cloud: If the customer doesn’t have a large budget, is not sure about data & load growth, and doesn’t have strong security requirements or a team of engineers to support hardware. 3. Hybrid: Mixture of requirements above. webinar
  46. 46. Scalability & Performance
  47. 47. Dedicated Clusters Visualization Service Data Ingestion Service Analytics Service VM1 VM2 VM3 VM1 VM2 VM2 VM4 VM5 VM6 VM7 VM8 • Configuration and management of 3 separate clusters. • Resources stay idle if service is not active. • Need to move data between clusters for each service. webinar
  48. 48. Shared Clusters Visualization Service Data Ingestion Service Analytics Service Multiple clusters Multiple clusters maximize utilization share data between services webinar
  49. 49. Shared Clusters: Mesos/Docker OpenStack / AWS / Google Cloud / Azure VM5VM1 VM2 VM3 VM4
  50. 50. Shared Clusters: Mesos/Docker Maximize utilization & performance: Deliver more services with smaller footprint. Shared clusters for all services: Easier deployment and management with unified service platform. Shared data between services: Faster and more competitive services and solutions. webinar
  51. 51. How Does this Work? Zookeeper quorum Mesos Master Mesos Master Mesos Master Spark Service Scheduler Marathon Service Scheduler Mesos Slave Spark Task Executor Mesos Executor Mesos Slave Docker Executor Docker Executor Task #1 Task #2 ./python XYZ java -jar XYZ.jar ./xyz
  52. 52. How Does this Work? Mesos provides fine grained resource isolation Mesos Slave Process Spark Task Executor Mesos Executor Task #2 ./python XYZ Compute Node Executor Container (cgroups) Task #1 webinar
  53. 53. How Does this Work? Mesos provides scalability Mesos Slave Process Spark Task Executor Task #2 Compute Node Container (cgroups) Task #1 Python executor finished, more available resources, and more spark. Task #4Task #3 webinar
  54. 54. How Does this Work? VM5VM1 VM2 VM3 VM4 Mesos has no single point of failure Services keep running if VM fails! Mesos Master Mesos Master Mesos Master webinar
  55. 55. How Does this Work? VM5VM1 VM2 VM3 VM4 Master node can failover Services keep running if Mesos Master fails! Mesos Master Mesos Master Mesos Master webinar
  56. 56. How Does this Work? Slave process can failover Tasks keep running if Mesos Slave Process fails! Mesos Slave Process Spark Task Executor Task #2 Compute Node Task #1 Task #4Task #3 webinar
  57. 57. Scalability & Performance: Checklist 1. If you need real scalability then use shared clusters. 2. Shared clusters love to host in Cloud. 3. Scalability means performance (in most cases). Use it as a synonym. webinar
  58. 58. Storage
  59. 59. Netflix Storage: Situation 1. ~25PB Data Warehouse on Amazon S3. 2. Read ~10% daily. 3. Write ~10% daily. 4. ~550 billion events daily. 5. ~350 active platform users (> 80% – Data Science engineers). webinar
  60. 60. Netflix Storage: Architecture (2013) webinar
  61. 61. Netflix Storage: Architecture (2014)
  62. 62. Netflix Storage: Architecture (2015) platform?qid=a9bda293-24df-4f6f-a06a-5b02eb751b35&v=&b=&from_search=1
  63. 63. Storage Comparison 1. Amazon S3: universal access, cheap, and data needs to be copied before processing. 2. HDFS: compatible with Hadoop ecosystem, relatively cheap, and data can be processed where it is being stored. 3. Directly Attached Storage/Network Attached Storage: expensive, fastest access to data, and it also can be processed where data is being stored. webinar
  64. 64. Storage: Checklist 1. If you need unified access to data and use some universal Cloud FS, then this would be similar to Amazon S3. 2. For immediate access to data (OLTP system), you need Directly Attached Storage (DAS), Network Attached Storage (NAS), Elastic Block Storage (Amazon EBS), and so on. 3. If you choose NoSQL, you’ll need much more space than actual data (each query might require duplicate copy of data). 4. Pick storage carefully and use PoC/Prototyping, otherwise changing storage later on will be hard to almost impossible. webinar
  65. 65. Final Checklist
  66. 66. Final Checklist 1. You’re the Boss! 2. You have a right to demand the infrastructure you need. 3. However, you need to have perfect argumentation. 4. Now you have it and know where to get details. 5. Good luck and see you in the field!  webinar
  67. 67. Contacts webinar
  68. 68. USA HQ Toll Free: 866-687-3588 Tel: +1-512-516-8880 Ukraine HQ Tel: +380-32-240-9090 Bulgaria Tel: +359-2-902-3760 Germany Tel: +49-69-2602-5857 Netherlands Tel: +31-20-262-33-23 Poland Tel: +48-71-382-2800 UK Tel: +44-207-544-8414 EMAIL WEBSITE: Thank you!