Hadoop at Ayasdi
Mohit Jaggi
and
Huang Xia, Zhen Li, Ajith Warrier
Overview
- HDFS for storage
- YARN for integration into Hadoop data lake
- Parquet as the file format
- bigdf based on Spark for feature
engineering, data wrangling
Hadoop
Data
Lake
bigdf
Resource Broker
Apache Spark
YARN
Algorithms
UI/API handler
Architecture
!! Audience Poll !!
1. How many data scientists?
2. How many backend engineers?
3. UI/frontend engineers?
4. Using hadoop in production?
5. Using Spark in production?
6. Personally worked on data bigger than 100GB? 1TB? 10TB?
1PB?
HDFS
HDFS - Motivation
- installed base, large community
- ecosystem to connect to most other data
sources
- commodity cluster
- experiments with distributed NAS didn't
show enough promise to justify the
additional cost and complexity
HDFS - Usage
- used as distributed storage
- jobs dispatched to least loaded node
- run Spark jobs
FishFiFishing for insights...
big
data
YARN - Motivation
- Ayasdi scheduler
- maximize throughput for batch jobs
- minimize latency for interactive “tasklets”
- wanted to deploy in existing Hadoop data
lakes
- integrated inhouse scheduler with YARN
- “tasklets” get a long running container
- batch jobs get a container on demand
YARN - Challenges
- increased latency observable for small batch
jobs
- early adopter pains
- sparse documentation
- not the best API design
big data store
compressed data
Parquet - Motivation
- legacy: data stored in both row and column
major
- requires expensive transpose on ingestion
- were designing a “tiled file format” when
discovered parquet
Parquet - Challenges
- early adopter challenges
- sparse documentation
- needed to access package private APIs
big data
bigdf - Motivation
- born out of experience using spark for
feature engineering
- creating classes for RDDs not reusable
across projects
- SQL not expressive enough
bigdf - details
- open source since Sep 2014
- precedes Spark DataFrame, so built on spark-core
engine
- experimenting with Catalyst using Spark DataFrame
APIs, looks promising
- python and scala APIs
- feature engineering library [not open source :-( ]
- fast CSV reader(and other features) contributed to
spark-csv
bigdf - future
- wrapper around Spark DF
- to protect from API changes
- to add features e.g. “sparse column set” as “round-
trip time” for pull requests into large open source
projects is high
Thanks!
www.ayasdi.com
http://engineering.ayasdi.com
https://github.com/AyasdiOpenSource/bigdf
http://www.ayasdi.com/company/careers/

Hadoop at ayasdi

  • 1.
    Hadoop at Ayasdi MohitJaggi and Huang Xia, Zhen Li, Ajith Warrier
  • 2.
    Overview - HDFS forstorage - YARN for integration into Hadoop data lake - Parquet as the file format - bigdf based on Spark for feature engineering, data wrangling
  • 3.
  • 4.
    !! Audience Poll!! 1. How many data scientists? 2. How many backend engineers? 3. UI/frontend engineers? 4. Using hadoop in production? 5. Using Spark in production? 6. Personally worked on data bigger than 100GB? 1TB? 10TB? 1PB?
  • 5.
  • 6.
    HDFS - Motivation -installed base, large community - ecosystem to connect to most other data sources - commodity cluster - experiments with distributed NAS didn't show enough promise to justify the additional cost and complexity
  • 7.
    HDFS - Usage -used as distributed storage - jobs dispatched to least loaded node - run Spark jobs
  • 8.
  • 9.
    YARN - Motivation -Ayasdi scheduler - maximize throughput for batch jobs - minimize latency for interactive “tasklets” - wanted to deploy in existing Hadoop data lakes - integrated inhouse scheduler with YARN - “tasklets” get a long running container - batch jobs get a container on demand
  • 10.
    YARN - Challenges -increased latency observable for small batch jobs - early adopter pains - sparse documentation - not the best API design
  • 11.
  • 12.
    Parquet - Motivation -legacy: data stored in both row and column major - requires expensive transpose on ingestion - were designing a “tiled file format” when discovered parquet
  • 13.
    Parquet - Challenges -early adopter challenges - sparse documentation - needed to access package private APIs
  • 14.
  • 15.
    bigdf - Motivation -born out of experience using spark for feature engineering - creating classes for RDDs not reusable across projects - SQL not expressive enough
  • 16.
    bigdf - details -open source since Sep 2014 - precedes Spark DataFrame, so built on spark-core engine - experimenting with Catalyst using Spark DataFrame APIs, looks promising - python and scala APIs - feature engineering library [not open source :-( ] - fast CSV reader(and other features) contributed to spark-csv
  • 17.
    bigdf - future -wrapper around Spark DF - to protect from API changes - to add features e.g. “sparse column set” as “round- trip time” for pull requests into large open source projects is high
  • 18.