D AV I D G E V O R K YA N
@ d a v i d g e v
d a v i d g e v o r k y a n
G R A D U AT E D A U A I N 2 0 0 8
W H AT I S B I G D ATA ?
FA S H I O N A B L E T E R M ?
8 0 % O F D ATA E X I S T I N G I N A N Y E N T E R P R I S E I S
U N S T R U C T U R E D D ATA
ST R U C T U R E D 	
  
DATA
S E M I -­‐ 	
  
ST R U C T U R E D 	
  
U N ST R U C T U R E D 	
  
DATA
RDBMS Data Warehousing
9 0 % O F T H E D ATA I N T H E W O R L D T O D AY H A S
B E E N C R E AT E D I N T H E L A S T T W O Y E A R S A L O N E
S o u rc e : h t t p : / / w w w. i n t e l . c o m / c o n t e n t / w w w / u s / e n / c o m m u n i c a t i o n s / i n t e r n e t - m i n u t e - i n f o g r a p h i c . h t m l
4 V ’ S O F B I G D ATA
VOLUME (large amount of data)
VARIETY (sensors, video, audio, email, social)
VELOCITY (speed of data generation)
VERACITY (authenticity and/or accuracy)
S O L U T I O N S R E Q U I R E D
f o rc e s y o u t o c h a n g e t h e w a y y o u
• C O L L E C T
• T R A N S P O RT
• S T O R E
• M A N A G E
• A N A LY Z E
• V I S U A L I Z E
W H AT I S D ATA S C I E N C E ?
D ATA S C I E N C E ! = S TAT I S T I C A L A N A LY S I S
I T I S S C I E N C E A N D “ A RT ” O F …
• E X P L O R I N G T H E U N K N O W N A B O U T D ATA
“ m a k e d i s c o v e r i e s w h i l e s w i m m i n g i n t h e d a t a ”
• R E F I N I N G T H E R E S U LT S F O R A C C U R A C Y
• D E R I V I N G A C T I O N A B L E I N S I G H T
• C R E AT I N G D ATA - D R I V E N P R O D U C T S
W H O A R E D ATA S C I E N T I S T S ?
W H O A R E D ATA S C I E N T I S T S ?
D re w C o n w a y, 2 0 1 0
B I G D ATA S C I E N C E T O O L S ?
• S c a l a , J a v a , P y t h o n , R … ( b o n u s : C l o j u re , H a s k e l l , E r l a n g )
• H a d o o p , H D F S , M a p R e d u c e … ( b o n u s : S p a r k , S t o r m , Te z )
• S c a l d i n g , H B a s e , P i g , H i v e … ( b o n u s : S h a r k , T i t a n , G i r a p h )
• F l u m e , S q o o p , E T L , We b s c r a p e r s … ( b o n u s : H u m e )
• S Q L , R D B M S , D W, O L A P… ( b o n u s : S O L R , E l a s t i c S e a rc h )
• K n i m e , We k a , R a p i d M i n e r… ( b o n u s : S c i P y, N u m P y, P a n d a s )
• D 3 . j s , K i b a n a , g g p l o t 2 , Ta b l e u … ( b o n u s : S h i n y, F l a re ,
D a t a m e e r )
• S P S S , M a t l a b , S A S … ( t h e e n t e r p r i s e m a n )
• N o S Q L , M o n g o D B , C a s s a n d r a , C o u c h D B
• A n d Ye s ! … M S - E x c e l : t h e m o s t u s e d , m o s t u n d e r r a t e d D S t o o l
G O A L ?
• R e v e n u e , re v e n u e , re v e n u e
• I m p ro v e t h e c u s t o m e r e x p e r i e n c e
• I n c re a s e o p e r a t i o n a l e ff i c i e n c y
• G E : O p t i m i z e m a i n t e n a n c e i n t e r v a l s f o r i n d u s t r i a l
p ro d u c t s
• G o o g l e : R e f i n e s e a rc h a n d a d - s e r v i n g a l g o r i t h m s
• Z y n g a : O p t i m i z e t h e g a m e e x p e r i e n c e f o r b o t h
l o n g - t e r m e n g a g e m e n t a n d re v e n u e
• N e t f l i x : M o v i e re c o m m e n d a t i o n s
• K a p l a n : U n c o v e r e ff e c t i v e l e a r n i n g s t r a t e g i e s
• e H a r m o n y : C re a t e h a p p y re l a t i o n s h i p s
W H O A R E W E ?
T R A D I T I O N A L M E T H O D S D O N O T W O R K
A N Y M O R E …
E H A R M O N Y C R E AT E S
T H E H A P P I E S T,
M O S T PA S S I O N AT E
A N D M O S T F U L F I L L I N G
R E L AT I O N S H I P S *
* A C C O R D I N G T O A R E C E N T S T U D Y
4 3 8
M A R R I A G E S P E R D AY
T H E D I F F E R E N C E ?
T H E D I F F E R E N C E ?
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
T H E D I F F E R E N C E ?
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
Nicolette
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
B I D I R E C T I O N A L
Leo
Ian
Steve
Nicolette
U N I D I R E C T I O N A L U S E R D E F I N E D C R I T E R I A
Leo
Ian
Steve
Nicolette
B I D I R E C T I O N A L
150	
  	
  
ques5ons
Personality	
  
Values	
  
A@ributes	
  
Beliefs
Intellect
Energy
Sociability
Ambition
Kindness
Curiosity
Humor
Spirituality
C O M PAT I B I L I T Y M AT C H I N G
U S E R D E F I N E D
C R I T E R I A
C O M PAT I B I L I T Y
M O D E L S
M O N G O D B
V O L D E M O RT
M O N G O D B
DATA STORE NEEDS
P O W E R F U L
I N D E X I N G
M O D E L S
FA S T M U LT I -
AT T R I B U T E
S E A R C H E S
E A S Y T O
M A I N TA I N
6 0 M +
Q U E R I E S
per day
M O N G O D B
WINS
A U T O
S C A L I N G
B U I LT- I N
S H A R D I N G
A U T O
B A L A N C I N G
M M S
V O L D E M O RT ?
T H AT N A M E
S O U N D S FA M I L I A R
V O L D E M O RT
DATA STORE NEEDS
C R U D
O P E R AT I O N S
VA R I E D
T R A N S A C T I O N
S I Z E S
B I L L I O N +
P O T E N T I A L
M AT C H E S
per day
V O L D E M O RT
WINS
A U T O
R E P L I C AT I O N
A U T O
PA RT I T I O N I N G
P L U G G A B L E
S E R I A L I Z AT I O N
A F F I N I T Y M AT C H I N G
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
65 30
3000 miles
Commprobability
Distance in Miles
0 1 3 7 15 63 255 1023 4095
P R O B
Commprobability
Height difference in cm
-29 -25 -21 -17 -13 -9 -6 -3 0 3 6 9 12 16 20 24 28 32 36 40 44 48 52 56
4	
  -­‐	
  8	
  in
P R O B
W O R D S T O U S E
W O R D S T O U S E
S O M E I N S I G H T
D ATA N E E D S F O R A F F I N I T Y
5 0 M + R E G I S T E R E D U S E R S
1 0 3
AT T R I B U T E S
1 0 7
D A I LY M AT C H E S
2 5 0 M +
P H O T O S
4 B + Q U E S T I O N N A I R E S
A N S W E R E D
C O M M U N I C AT I O N A G G R E G AT E S
E V E N T L I S T E N E R
S E R V I C E
U S E R A C T I V I T Y
S E R V I C E
~ 5 M S
R E S P O N S E
T I M E S
1 0 K E V E N T S
P E R S E C O N D
U S E R
S E R V I C E
H O U R LY, D A I LY
T O TA L
O F F L I N E B AT C H J O B S
U S E R
S E R V I C E
M A P - S I D E J O I N S
( T B )
S C O R I N G
1+GB	
  Compressed	
  Protocol	
  
Buffers	
  
PA I R I N G S
S E R V I C E
750M	
  Compressed	
  
Protocol	
  Buffers	
  
B I L L I O N +
P O T E N T I A L
M AT C H E S
A M A Z O N
E M R
AW S D I R E C T
C O N N E C T
2 5 6 N O D E S
5 0 T B S T O R A G E
I N - H O U S E
S E A M I C R O
D ATA R E T R I E VA L L AT E N C Y
L O W O P E R AT I O N A L C O S T
L O W P O W E R C O N S U M P T I O N
P R E D I C TA B L E C O M P L E T I O N T I M E S
M O D E L R E T R A I N I N G
distcp
Protocol	
  Buffers	
  from	
  
Offline	
  Jobs	
  
M AT C H D I S T R I B U T I O N
Compatibility Matching System®
C O M PAT I B I L I T Y
M AT C H I N G
A F F I N I T Y
M AT C H I N G
M AT C H
D I S T R I B U T I O N
Delivering the right matches
at the right time to as many
people as possible across
the entire network
T H A N K Y O U
Q U E S T I O N S ?
C R E D I T S :
The Noun Project
http://thenounproject.com
Visual Elements From

AUA Data Science Meetup

  • 2.
    D AV ID G E V O R K YA N @ d a v i d g e v d a v i d g e v o r k y a n
  • 3.
    G R AD U AT E D A U A I N 2 0 0 8
  • 4.
    W H ATI S B I G D ATA ?
  • 5.
    FA S HI O N A B L E T E R M ?
  • 6.
    8 0 %O F D ATA E X I S T I N G I N A N Y E N T E R P R I S E I S U N S T R U C T U R E D D ATA ST R U C T U R E D   DATA S E M I -­‐   ST R U C T U R E D   U N ST R U C T U R E D   DATA RDBMS Data Warehousing
  • 7.
    9 0 %O F T H E D ATA I N T H E W O R L D T O D AY H A S B E E N C R E AT E D I N T H E L A S T T W O Y E A R S A L O N E S o u rc e : h t t p : / / w w w. i n t e l . c o m / c o n t e n t / w w w / u s / e n / c o m m u n i c a t i o n s / i n t e r n e t - m i n u t e - i n f o g r a p h i c . h t m l
  • 8.
    4 V ’S O F B I G D ATA VOLUME (large amount of data) VARIETY (sensors, video, audio, email, social) VELOCITY (speed of data generation) VERACITY (authenticity and/or accuracy)
  • 9.
    S O LU T I O N S R E Q U I R E D f o rc e s y o u t o c h a n g e t h e w a y y o u • C O L L E C T • T R A N S P O RT • S T O R E • M A N A G E • A N A LY Z E • V I S U A L I Z E
  • 11.
    W H ATI S D ATA S C I E N C E ?
  • 12.
    D ATA SC I E N C E ! = S TAT I S T I C A L A N A LY S I S I T I S S C I E N C E A N D “ A RT ” O F … • E X P L O R I N G T H E U N K N O W N A B O U T D ATA “ m a k e d i s c o v e r i e s w h i l e s w i m m i n g i n t h e d a t a ” • R E F I N I N G T H E R E S U LT S F O R A C C U R A C Y • D E R I V I N G A C T I O N A B L E I N S I G H T • C R E AT I N G D ATA - D R I V E N P R O D U C T S
  • 13.
    W H OA R E D ATA S C I E N T I S T S ?
  • 14.
    W H OA R E D ATA S C I E N T I S T S ? D re w C o n w a y, 2 0 1 0
  • 15.
    B I GD ATA S C I E N C E T O O L S ?
  • 16.
    • S ca l a , J a v a , P y t h o n , R … ( b o n u s : C l o j u re , H a s k e l l , E r l a n g ) • H a d o o p , H D F S , M a p R e d u c e … ( b o n u s : S p a r k , S t o r m , Te z ) • S c a l d i n g , H B a s e , P i g , H i v e … ( b o n u s : S h a r k , T i t a n , G i r a p h ) • F l u m e , S q o o p , E T L , We b s c r a p e r s … ( b o n u s : H u m e ) • S Q L , R D B M S , D W, O L A P… ( b o n u s : S O L R , E l a s t i c S e a rc h ) • K n i m e , We k a , R a p i d M i n e r… ( b o n u s : S c i P y, N u m P y, P a n d a s ) • D 3 . j s , K i b a n a , g g p l o t 2 , Ta b l e u … ( b o n u s : S h i n y, F l a re , D a t a m e e r ) • S P S S , M a t l a b , S A S … ( t h e e n t e r p r i s e m a n ) • N o S Q L , M o n g o D B , C a s s a n d r a , C o u c h D B • A n d Ye s ! … M S - E x c e l : t h e m o s t u s e d , m o s t u n d e r r a t e d D S t o o l
  • 18.
    G O AL ?
  • 19.
    • R ev e n u e , re v e n u e , re v e n u e • I m p ro v e t h e c u s t o m e r e x p e r i e n c e • I n c re a s e o p e r a t i o n a l e ff i c i e n c y • G E : O p t i m i z e m a i n t e n a n c e i n t e r v a l s f o r i n d u s t r i a l p ro d u c t s • G o o g l e : R e f i n e s e a rc h a n d a d - s e r v i n g a l g o r i t h m s • Z y n g a : O p t i m i z e t h e g a m e e x p e r i e n c e f o r b o t h l o n g - t e r m e n g a g e m e n t a n d re v e n u e • N e t f l i x : M o v i e re c o m m e n d a t i o n s • K a p l a n : U n c o v e r e ff e c t i v e l e a r n i n g s t r a t e g i e s • e H a r m o n y : C re a t e h a p p y re l a t i o n s h i p s
  • 20.
    W H OA R E W E ?
  • 21.
    T R AD I T I O N A L M E T H O D S D O N O T W O R K A N Y M O R E …
  • 22.
    E H AR M O N Y C R E AT E S T H E H A P P I E S T, M O S T PA S S I O N AT E A N D M O S T F U L F I L L I N G R E L AT I O N S H I P S * * A C C O R D I N G T O A R E C E N T S T U D Y
  • 23.
    4 3 8 MA R R I A G E S P E R D AY
  • 24.
    T H ED I F F E R E N C E ?
  • 25.
    T H ED I F F E R E N C E ? Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 26.
    T H ED I F F E R E N C E ? Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 27.
    U N ID I R E C T I O N A L U S E R D E F I N E D C R I T E R I A Nicolette
  • 28.
    U N ID I R E C T I O N A L U S E R D E F I N E D C R I T E R I A B I D I R E C T I O N A L Leo Ian Steve Nicolette
  • 29.
    U N ID I R E C T I O N A L U S E R D E F I N E D C R I T E R I A Leo Ian Steve Nicolette B I D I R E C T I O N A L
  • 33.
    150     ques5ons Personality   Values   A@ributes   Beliefs
  • 34.
  • 35.
    C O MPAT I B I L I T Y M AT C H I N G U S E R D E F I N E D C R I T E R I A C O M PAT I B I L I T Y M O D E L S M O N G O D B V O L D E M O RT
  • 36.
    M O NG O D B DATA STORE NEEDS P O W E R F U L I N D E X I N G M O D E L S FA S T M U LT I - AT T R I B U T E S E A R C H E S E A S Y T O M A I N TA I N 6 0 M + Q U E R I E S per day
  • 37.
    M O NG O D B WINS A U T O S C A L I N G B U I LT- I N S H A R D I N G A U T O B A L A N C I N G M M S
  • 38.
    V O LD E M O RT ? T H AT N A M E S O U N D S FA M I L I A R
  • 39.
    V O LD E M O RT DATA STORE NEEDS C R U D O P E R AT I O N S VA R I E D T R A N S A C T I O N S I Z E S B I L L I O N + P O T E N T I A L M AT C H E S per day
  • 40.
    V O LD E M O RT WINS A U T O R E P L I C AT I O N A U T O PA RT I T I O N I N G P L U G G A B L E S E R I A L I Z AT I O N
  • 41.
    A F FI N I T Y M AT C H I N G Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 42.
  • 43.
    Commprobability Distance in Miles 01 3 7 15 63 255 1023 4095 P R O B
  • 45.
    Commprobability Height difference incm -29 -25 -21 -17 -13 -9 -6 -3 0 3 6 9 12 16 20 24 28 32 36 40 44 48 52 56 4  -­‐  8  in P R O B
  • 46.
    W O RD S T O U S E
  • 47.
    W O RD S T O U S E
  • 48.
    S O ME I N S I G H T
  • 49.
    D ATA NE E D S F O R A F F I N I T Y 5 0 M + R E G I S T E R E D U S E R S 1 0 3 AT T R I B U T E S 1 0 7 D A I LY M AT C H E S 2 5 0 M + P H O T O S 4 B + Q U E S T I O N N A I R E S A N S W E R E D
  • 50.
    C O MM U N I C AT I O N A G G R E G AT E S E V E N T L I S T E N E R S E R V I C E U S E R A C T I V I T Y S E R V I C E ~ 5 M S R E S P O N S E T I M E S 1 0 K E V E N T S P E R S E C O N D U S E R S E R V I C E H O U R LY, D A I LY T O TA L
  • 51.
    O F FL I N E B AT C H J O B S U S E R S E R V I C E M A P - S I D E J O I N S ( T B ) S C O R I N G 1+GB  Compressed  Protocol   Buffers   PA I R I N G S S E R V I C E 750M  Compressed   Protocol  Buffers   B I L L I O N + P O T E N T I A L M AT C H E S
  • 52.
    A M AZ O N E M R AW S D I R E C T C O N N E C T 2 5 6 N O D E S 5 0 T B S T O R A G E I N - H O U S E S E A M I C R O D ATA R E T R I E VA L L AT E N C Y L O W O P E R AT I O N A L C O S T L O W P O W E R C O N S U M P T I O N P R E D I C TA B L E C O M P L E T I O N T I M E S
  • 53.
    M O DE L R E T R A I N I N G distcp Protocol  Buffers  from   Offline  Jobs  
  • 54.
    M AT CH D I S T R I B U T I O N Compatibility Matching System® C O M PAT I B I L I T Y M AT C H I N G A F F I N I T Y M AT C H I N G M AT C H D I S T R I B U T I O N
  • 55.
    Delivering the rightmatches at the right time to as many people as possible across the entire network
  • 62.
    T H AN K Y O U Q U E S T I O N S ?
  • 63.
    C R ED I T S : The Noun Project http://thenounproject.com Visual Elements From