Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
THE INS & OUTS OF DATA TRANSFER
LOS ANGELES AWS USERS GROUP
JASON DAVIS, CEO SIMON DATA
@JASONDAVIS
DRJASONDAVIS.COM
A TYPICAL DATA ECOSYSTEM
OLTP/RDS
DATA LAKE / REDSHIFT / S3
USERS FRONTEND
ANALYTICS
BACK OFFICE
"THE BIZ"
CORE TECH
3P TE...
A gentle introduction to data transfer & "ETL"
An overview of common failure cases
Best practices and some high level guid...
SOME TYPICAL DATA TRANSFERS
WEB ANALYTICS
"BUSINESS" REPORTING
ACQUISITION /
LTV ANALYSIS
EMAIL SEGMENTATION
Product recommendations
Extract: skus, purchase / browse history, profit margins
Transform: Deep learning / recommender sy...
ETL: the process of pulling data from one or more sources for use in another
Extract data from one or more sources
Databas...
Extraction failures
Source unavailable
Data corrupt / incomplete - upstream error
Transform failures
Resources unavailable...
Maintaining state between two systems is hard
The basic problem of 1-1 syncing is hard in itself
Incrementals, cursor base...
Break your pipeline into small steps
Large SQL statements are hard to test
SQL in general is hard to unit test - it's a de...
Idempotent. A unary operation (or function) is idempotent if, whenever it is applied twice to any value, it gives the
same...
Start with fine-grained logs
"Measure Anything, Measure Everything" - Etsy, Code as Craft
Alert on things that are mission...
THANKS
QUESTIONS?
DRJASONDAVIS.COM
EMAIL ME: JASON@SIMONDATA.COM
Upcoming SlideShare
Loading in …5
×

The ins & outs of data transfer

708 views

Published on

This presentation overviews some of the core use cases and business requirements that drive standard ETL processes. The presentation then addresses common failure cases and provides high level development and architectural guidance.

Published in: Data & Analytics
  • Hello! I do no use writing service very often, only when I really have problems. But this one, I like best of all. The team of writers operates very quickly. It's called ⇒ www.HelpWriting.net ⇐ Hope this helps!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

The ins & outs of data transfer

  1. 1. THE INS & OUTS OF DATA TRANSFER LOS ANGELES AWS USERS GROUP JASON DAVIS, CEO SIMON DATA @JASONDAVIS DRJASONDAVIS.COM
  2. 2. A TYPICAL DATA ECOSYSTEM OLTP/RDS DATA LAKE / REDSHIFT / S3 USERS FRONTEND ANALYTICS BACK OFFICE "THE BIZ" CORE TECH 3P TECH / SAAS CRM / ERPEMAIL / PUSH / SMS GRAPHS / BI APPLICATION
  3. 3. A gentle introduction to data transfer & "ETL" An overview of common failure cases Best practices and some high level guidance OVERVIEW
  4. 4. SOME TYPICAL DATA TRANSFERS WEB ANALYTICS "BUSINESS" REPORTING ACQUISITION / LTV ANALYSIS EMAIL SEGMENTATION
  5. 5. Product recommendations Extract: skus, purchase / browse history, profit margins Transform: Deep learning / recommender systems Load: user / sku recommendations into a production DB Inventory planning Extract: historical sales, inventory and shipping costs Transform: Stockage goal estimation Load: Sku-level forecasts into an ERP system Executive dashboard Extract: revenue, traffic, support volume, operational data Transform: basic aggregates Load: pie charts, vanity metrics driven by a reporting DB SOME MORE TYPICAL DATA TRANSFERS
  6. 6. ETL: the process of pulling data from one or more sources for use in another Extract data from one or more sources Database, event streams, S3, Salesforce, email metrics Transform data via aggregations, joins, filters, and/or predictive analysis Parallel (Hadoop, Spark), In-core (Redshift), Scripts (Python, bash) Load data into destination Database / Redshift, S3, HDFS, SaaS, ERP, CRM, email platform, etc. DATA TRANSFER IN 3 STEPS: EXTRACT-TRANSFORM-LOAD E T L
  7. 7. Extraction failures Source unavailable Data corrupt / incomplete - upstream error Transform failures Resources unavailable / exceeded: OOM Broken computation: Bad math / DBZ Load failures Validation errors Connectivity errors Availability / bandwidth limitations Failures can cascade in unexpected ways MOVING DATA IS HARD: COMMON FAILURE CASES
  8. 8. Maintaining state between two systems is hard The basic problem of 1-1 syncing is hard in itself Incrementals, cursor based extractors are all prone to failure Failure cases are wide, varied, and data-driven Generally require running in real-world context for an extended period Many times failures are silent Ensuring correctness is hard / impossible Run-times are generally longer which strain unit testing best practices FUNDAMENTAL CHALLENGES =?
  9. 9. Break your pipeline into small steps Large SQL statements are hard to test SQL in general is hard to unit test - it's a declarative language after all Data flow languages such as spark / cascading are easier to test Build patterns to be able to easily test real-world inputs against outputs Unit testing timeout errors and other exceptional cases are hard to test in isolation WRITE UNIT TESTS BUT TEMPER EXPECTATIONS DATA PIPES ARE HARD TO UNIT TEST
  10. 10. Idempotent. A unary operation (or function) is idempotent if, whenever it is applied twice to any value, it gives the same result as if it were applied once; i.e., ƒ(ƒ(x)) ≡ ƒ(x). For example, the absolute value function, where abs(abs(x)) ≡ abs(x), is idempotent. In layman's terms: your code has the same result if you run it one, two, or three or more times. Why is this important? Oftentimes you won't know if something was successful or not. Solution: Idempotency allows you to "just run it again" IDEMPOTENCY "THINGS DON'T ALWAYS TAKE ON THE FIRST TRY...."
  11. 11. Start with fine-grained logs "Measure Anything, Measure Everything" - Etsy, Code as Craft Alert on things that are mission critical or have well-known failure characteristic VISIBILITY: LOGGING, GRAPHING, & ALERTING OPTIMIZE FOR TIME TO DETECTION
  12. 12. THANKS QUESTIONS? DRJASONDAVIS.COM EMAIL ME: JASON@SIMONDATA.COM

×