0
Hadoop and Hive at Square
Nicolas Thiébaud
!
nicothieb@
nicolas@squareup.com
Data Engineering at Square
July 2014
Square: Make
commerce easy
Remove crappy POSes from the counter
Building the best register for small businesses.
Started w...
Data at Square
Internal Data
!
Produced on app servers (~200+ services),
mysql or psql
!
Logging and tracing from apps and...
Data Architecture at
Square: Kafka
Historical, most of our users still use this
App DB -> Analytical DB stripping out PII,...
Most datasets don’t fit in mysql. Most queries
cannot run anymore
Analysts broke down their jobs to run on single
day windo...
Transitioning to Hive
Stability
!
Hive 10 + Hue 2.5 as starting point + many
patches -> 2 restarts a day with small load
!...
Project Babar: Building
a stable Hive 12
Project Babar: Building a stable Hive 12
Patch open source hive to address
Square specific issues
!
Setup integration tests...
Internal Hive Build
cdh5-0.12.0_5.0.1 branch + 9 commits
3 test fixes, 2 square specific changes (pom
+ ci)
!
DATAPLAT-436 B...
Story of HIVE-7040 + HIVE-5799
HIVE-7040: Allow TCP keep alive on Hive
Server 2
F5 stateful firewall kills open connections...
Hive Ops trick:
./wait_for_hive_jobs && sudo sv restart /var/service/hiveserver
Next Steps
Figure out the best way to contribute back
patches
!
HIVE-668{3,4}: Beeline comments suck
HIVE-7200: Beeline ou...
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
Hive et Hadoop Usage chez Square
Upcoming SlideShare
Loading in...5
×

Hive et Hadoop Usage chez Square

462

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
462
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Hive et Hadoop Usage chez Square"

  1. 1. Hadoop and Hive at Square Nicolas Thiébaud ! nicothieb@ nicolas@squareup.com Data Engineering at Square July 2014
  2. 2. Square: Make commerce easy Remove crappy POSes from the counter Building the best register for small businesses. Started with card processing and bringing more value to merchants using the point of sale. ! Merchant and Buyer facing products Square Register, Square Cash, Pickup, Feedback ! Data products Merchant Analytics, Capital
  3. 3. Data at Square Internal Data ! Produced on app servers (~200+ services), mysql or psql ! Logging and tracing from apps and web to public endpoint ! Example: payment data, user data, ledger entries External Data ! Payment processing partners ship flat files to us Offline Data usage at Square ! BI/Analysis/Reporting: ~200 mysql users, ~100 hadoop users ! ML: Risk detection, recommendation ! Apps: A/B testing, Commercial support, Capital
  4. 4. Data Architecture at Square: Kafka Historical, most of our users still use this App DB -> Analytical DB stripping out PII, cursoring, looking at binlog replication ! Hadoop: Kafka as a backbone App DB -> Kafka using cursoring and PII stripping App Server -> Kafka (eg: tracing) in proto format Feed consumption -> Kafka ! Kafka written to hdfs using offsets, dupes are written when the consumer restarts ! Raw data is deduped and extracted from protos to rcfiles in daily batches. Everything is exposed in Hive
  5. 5. Most datasets don’t fit in mysql. Most queries cannot run anymore Analysts broke down their jobs to run on single day windows. The query sniper keeps hitting them. ! Mysql no longer supported as source of truth for offline data. Tables are windowed We keep revisiting the amount of data stored in MySQL ! Everyone must migrate to hive (users and apps) Mysql Analytical DBs will now be an export location for data reduced in Hadoop ! All datasets must be present in Hadoop Even small ones :) Transitioning to Hive
  6. 6. Transitioning to Hive Stability ! Hive 10 + Hue 2.5 as starting point + many patches -> 2 restarts a day with small load ! Decided to go to hive 12 and patch the bugs affecting us in an internal build ! Two major tasks: 10 -> 12 and building hive internally Reliability ! Sentinel, data validation daemon ! Conduit, hive etls ! Customer defined SLA’s Education ! Office hours, trainings, mailing list
  7. 7. Project Babar: Building a stable Hive 12
  8. 8. Project Babar: Building a stable Hive 12 Patch open source hive to address Square specific issues ! Setup integration tests in kochiku, no performance test ! Hiveserver only, no cli. Staging and production envs ! Push and pull changes to apache jira Build and deploy hive artifacts ! Makefile ! metastore, hiveserver (staging and prod), cli tools (beeline), hivesandbox ! package configuration Misc ! hue 3.5 ! hive-udfs
  9. 9. Internal Hive Build cdh5-0.12.0_5.0.1 branch + 9 commits 3 test fixes, 2 square specific changes (pom + ci) ! DATAPLAT-436 Beeline should return non- zero on invalid statements ! HIVE-5799: session/operation timeout for hiveserver2 HIVE-5707: Validate values for ConfVar ! HIVE-7040: Allow TCP keep alive on Hive Server 2 ! (merged in cdh5-0.12.0_5.0.1) HIVE-6893: out of sequence error in HiveMetastore
  10. 10. Story of HIVE-7040 + HIVE-5799 HIVE-7040: Allow TCP keep alive on Hive Server 2 F5 stateful firewall kills open connections HIVE-5799: session/operation timeout for hiveserver2 Beeline interrupt does not close sessions
  11. 11. Hive Ops trick: ./wait_for_hive_jobs && sudo sv restart /var/service/hiveserver
  12. 12. Next Steps Figure out the best way to contribute back patches ! HIVE-668{3,4}: Beeline comments suck HIVE-7200: Beeline output displays column heading even if --showHeader=false is set HIVE-4924: Support JDBC query timeouts HIVE-5232: Use async interface for jdbc ! Hive HA Shark Tez?
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×