SlideShare a Scribd company logo
Interac(ve	
  Big	
  data	
  analysis	
  
Viet-­‐Trung	
  Tran	
  
1	
  
MapReduce	
  wordcount	
  
2	
  
MR	
  –	
  batch	
  processing	
  
•  Long	
  running	
  job	
  
– latency	
  between	
  running	
  the	
  job	
  and	
  geBng	
  the	
  
answer	
  
•  Lot	
  of	
  computa(ons	
  
•  Specific	
  language	
  
3	
  
Example	
  Problem	
  
•  Jane	
  works	
  as	
  an	
  
analyst	
  at	
  an	
  e-­‐
commerce	
  company	
  
•  How	
  does	
  she	
  figure	
  
out	
  good	
  targe(ng	
  
segments	
  for	
  the	
  next	
  
marke(ng	
  campaign?	
  
•  She	
  has	
  some	
  ideas	
  
and	
  lots	
  of	
  data	
  
User	
  	
  
profiles	
  
Transac.on	
  
informa.on	
  
Access	
  
logs	
  
4	
  
Solving	
  the	
  problems?	
  
All	
  compiled	
  to	
  Map	
  Reduce	
  jobs	
  
5	
  
Dremel:	
  interac(ve	
  analysis	
  of	
  
web-­‐scale	
  datasets	
  
Melnik	
  et.	
  al,	
  Google	
  inc	
  
[VLDB	
  2010]	
  
6	
  
What	
  is	
  Dremel?	
  
•  Near	
  real	
  (me	
  interac(ve	
  analysis	
  (instead	
  batch	
  
processing).	
  SQL-­‐like	
  query	
  language	
  
–  Trillion	
  record,	
  mul(-­‐terabyte	
  datasets	
  
•  Nested	
  data	
  with	
  a	
  column	
  storage	
  representa(on	
  
•  Serving	
  tree:	
  mul(-­‐level	
  execu(on	
  trees	
  for	
  query	
  
processing	
  
•  Interoperates	
  "in	
  place"	
  with	
  GFS,	
  Big	
  Table	
  
•  The	
  engine	
  behind	
  Google	
  BigQuery	
  
•  Builds	
  on	
  the	
  ideas	
  from	
  web	
  search	
  and	
  parallel	
  
DBMS.	
  
7	
  
•  Brand of power tools that primarily rely on
their speed as opposed to torque
•  Data analysis tool that uses speed instead
of raw power
Why call it Dremel
8	
  
Widely used inside Google
•  Analysis of crawled web
documents
•  Tracking install data for
applications on Android
Market
•  Crash reporting for Google
products
•  OCR results from Google
Books
•  Spam analysis
•  Debugging of map tiles on
Google Maps
•  Tablet migrations in
managed Bigtable instances
•  Results of tests run on
Google's distributed build
system
•  Disk I/O statistics for
hundreds of thousands of
disks
•  Resource monitoring for
jobs run in Google's data
centers
•  Symbols and dependencies
in Google's codebase
9	
  
Records vs. columns
A	
  
B	
  
C	
   D	
  
E	
  
*	
  
*	
  
*	
  
.	
  .	
  .	
  
.	
  .	
  .	
  
r1	
  
r2	
   r1	
  
r2	
  
r1	
  
r2	
  
r1	
  
r2	
  
Challenge: preserve structure,
reconstruct from a subset of fields
Read less,
cheaper
decompression
10	
  
Columnar	
  format	
  
•  Values	
  in	
  a	
  column	
  stored	
  next	
  to	
  one	
  another	
  
– Beher	
  compression	
  
– Range-­‐map:	
  save	
  min-­‐max	
  
•  Only	
  access	
  columns	
  par(cipa(ng	
  in	
  query	
  
•  Aggrega(ons	
  can	
  be	
  done	
  without	
  decoding	
  
11	
  
Nested data model
message Document {
required int64 DocId; [1,1]
optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1	
  
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2	
  
multiplicity:
12	
  
Column-striped representation
value r d
10 0 0
20 0 0
DocId
value r d
http://A 0 2
http://B 1 2
NULL 1 1
http://C 0 2
Name.Url
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code Name.Language.Country
Links.BackwardLinks.Forward
value r d
us 0 3
NULL 2 2
NULL 1 1
gb 1 3
NULL 0 1
value r d
20 0 2
40 1 2
60 1 2
80 0 2
value r d
NULL 0 1
10 0 2
30 1 2
13	
  
Repetition and
definition levels
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
r1	
  
DocId: 20
Links
Backward: 10
Backward: 30
Forward: 80
Name
Url: 'http://C'
r2	
  
value r d
en-us 0 2
en 2 2
NULL 1 1
en-gb 1 2
NULL 0 1
Name.Language.Code
r: At what repeated field in the field's path
the value has repeated	
  
d: How many fields in paths that could be
undefined (opt. or rep.) are actually present	
  
record (r=0) has repeated	
  
r=2	
  r=1	
  
Language (r=2) has repeated	
  
(non-repeating)	
  
14	
  
Record assembly FSM	
  
message Document {
required int64 DocId; [1,1]
optional group Links {
repeated int64 Backward; [0,*]
repeated int64 Forward;
}
repeated group Name {
repeated group Language {
required string Code;
optional string Country; [0,1]
}
optional string Url;
}
}
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1	
  
0	
  
1	
  
0	
  
0,1,2	
  
2	
  
0,1	
  1	
  
0	
  
0	
  
Transitions
labeled with
repetition levels
15	
  
Record assembly FSM: example
Name.Language.CountryName.Language.Code
Links.Backward Links.Forward
Name.Url
DocId
1	
  
0	
  
1	
  
0	
  
0,1,2	
  
2	
  
0,1	
  1	
  
0	
  
0	
  
Transitions
labeled with
repetition levels
DocId: 10
Links
Forward: 20
Forward: 40
Forward: 60
Name
Language
Code: 'en-us'
Country: 'us'
Language
Code: 'en'
Url: 'http://A'
Name
Url: 'http://B'
Name
Language
Code: 'en-gb'
Country: 'gb'
16	
  
Reading two fields
DocId
Name.Language.Country1,2	
  
0	
  
0	
  
DocId: 10
Name
Language
Country: 'us'
Language
Name
Name
Language
Country: 'gb'
DocId: 20
Name
s1	
  
s2	
  
Structure of parent fields is preserved.
Useful for queries like /Name[3]/Language[1]/Country
17	
  
Query processing
•  Optimized for select-project-aggregate
– Very common class of interactive queries
– Single scan
– Within-record and cross-record aggregation
•  Approximations: count(distinct), top-k
•  Joins, temp tables, UDFs/TVFs, etc.
18	
  
SQL dialect for nested data
Id: 10
Name
Cnt: 2
Language
Str: 'http://A,en-us'
Str: 'http://A,en'
Name
Cnt: 0
t1	
  
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS Cnt,
Name.Url + ',' + Name.Language.Code AS Str
FROM t
WHERE REGEXP(Name.Url, '^http') AND DocId < 20;
message QueryResult {
required int64 Id;
repeated group Name {
optional uint64 Cnt;
repeated group Language {
optional string Str;
}
}
}
Output table	
   Output schema	
  
No record assembly during query processing	
  
19	
  
Serving tree
storage layer (e.g., GFS)
. . .	
  
. . .	
  
. . .	
  leaf servers
(with local
storage)	
  
intermediate
servers	
  
root server	
  
client	
  
!"
!#$"
!#%"
!#&"
!#'"
!#("
!#)"
!" %" '" )" *" $!" $%" $'" $)"
histogram of
response times	
  
20	
  
Mul(-­‐level	
  serving	
  tree	
  
•  Parallelizes scheduling and aggregation
– Reduced fan-in
– Divide/conquer
– Better network utilization
•  Fault tolerance
21	
  
Example: count()
SELECT A, COUNT(B) FROM T
GROUP BY A
T = {/gfs/1, /gfs/2, …, /gfs/100000}
SELECT A, SUM(c)
FROM (R11 UNION ALL R110)
GROUP BY A
SELECT A, COUNT(B) AS c
FROM T11 GROUP BY A
T11 = {/gfs/1, …, /gfs/10000}
SELECT A, COUNT(B) AS c
FROM T12 GROUP BY A
T12 = {/gfs/10001, …, /gfs/20000}
SELECT A, COUNT(B) AS c
FROM T31 GROUP BY A
T31 = {/gfs/1}
. . .	
  
0	
  
1	
  
3	
  
R11	
   R12	
  
Data access ops	
  
. . .	
  
. . .	
  
22	
  
Experiments
Table
name
Number of
records
Size (unrepl.,
compressed)
Number
of fields
Data
center
Repl.
factor
T1 85 billion 87 TB 270 A 3×
T2 24 billion 13 TB 530 A 3×
T3 4 billion 70 TB 1200 A 3×
T4 1+ trillion 105 TB 50 B 3×
T5 1+ trillion 20 TB 30 B 2×
•  1 PB of real data
(uncompressed, non-replicated)
•  100K-800K tablets per table
•  Experiments run during business hours
23	
  
!"
#"
$"
%"
&"
'!"
'#"
'$"
'%"
'&"
#!"
'" #" (" $" )" %" *" &" +" '!"
Read from disk
columns	
  
records	
  
objects	
  
fromrecords	
  fromcolumns	
  
(a) read +
decompress	
  
(b) assemble
records	
  
(c) parse as
C++ objects	
  
(d) read +
decompress	
  
(e) parse as
C++ objects	
  
time (sec)	
  
number of fields	
  
Table partition: 375 MB (compressed), 300K rows, 125 columns	
  
2-4x overhead of
using records
10x speedup
using columnar
storage
24	
  
MR and Dremel execution
Sawzall program ran on MR:
num_recs: table sum of int;
num_words: table sum of int;
emit num_recs <- 1;
emit num_words <-
count_words(input.txtField);!"
!#"
!##"
!###"
!####"
$%&'()*'+," $%&)*-./0," 1'(/(-"
execution time (sec) on 3000 nodes 	
  
SELECT SUM(count_words(txtField)) / COUNT(*)
FROM T1
Q1:	
  
87 TB	
   0.5 TB	
   0.5 TB	
  
MR overheads: launch jobs, schedule 0.5M tasks,
assemble records
Avg # of terms in txtField in 85 billion record table T1	
  
25	
  
Impact of serving tree depth
!"
#!"
$!"
%!"
&!"
'!"
(!"
)$" )%"
$"*+,+*-"
%"*+,+*-"
&"*+,+*-"
execution time (sec)	
  
SELECT country, SUM(item.amount) FROM T2

GROUP BY country
SELECT domain, SUM(item.amount) FROM T2

WHERE domain CONTAINS ’.net’

GROUP BY domain
Q2:
Q3:
40 billion nested items
(returns 100s of records) (returns 1M records)
26	
  
!"
#!"
$!!"
$#!"
%!!"
%#!"
$!!!" %!!!" &!!!" '!!!"
Scalability
execution time (sec)	
  
number of
leaf servers	
  
SELECT TOP(aid, 20), COUNT(*) FROM T4
Q5 on a trillion-row table T4:
27	
  
Interactive speed
!"
#"
$!"
$#"
%!"
%#"
&!"
$" $!" $!!" $!!!"
execution time
(sec)	
  
percentage of queries
Most queries complete under 10 sec
Monthly query workload
of one 3000-node Dremel
instance
28	
  
Observations
•  Possible to analyze large disk-resident datasets
interactively on commodity hardware
–  1T records, 1000s of nodes
•  MR can benefit from columnar storage just like a parallel
DBMS
–  But record assembly is expensive
–  Interactive SQL and MR can be complementary
•  Parallel DBMSes may benefit from serving tree
architecture just like search engines
29	
  
Vs.	
  MapReduce	
  
•  Scheduling	
  Model	
  
–  Coarse	
  resource	
  model	
  reduces	
  hardware	
  u(liza(on	
  
–  Acquisi(on	
  of	
  resources	
  typically	
  takes	
  100’s	
  of	
  millis	
  to	
  seconds	
  
•  Barriers	
  
–  Map	
  comple(on	
  required	
  before	
  shuffle/reduce	
  
commencement	
  
–  All	
  maps	
  must	
  complete	
  before	
  reduce	
  can	
  start	
  
–  In	
  chained	
  jobs,	
  one	
  job	
  must	
  finish	
  en(rely	
  before	
  the	
  next	
  one	
  
can	
  start	
  
•  Persistence	
  and	
  Recoverability	
  
–  Data	
  is	
  persisted	
  to	
  disk	
  between	
  each	
  barrier	
  
–  Serializa(on	
  and	
  deserializa(on	
  are	
  required	
  between	
  execu(on	
  
phase	
  
30	
  
Apache	
  Drill	
  
31	
  
32	
  
33	
  
34	
  
Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  SQL	
  like	
  is	
  not	
  enough	
  
•  Fine	
  integra(on	
  with	
  exis(ng	
  BI	
  tools	
  
– Tableau,	
  SAP	
  
– Standard	
  ODBC/JDBC	
  driver	
  
35	
  
Working	
  data	
  
•  Flat	
  files	
  in	
  DFS	
  
– Complex	
  data	
  (thrif,	
  Avro,	
  protobuf)	
  
– Columnar	
  data	
  (Parquet,	
  ORC)	
  
– JSON	
  
– CSV,	
  TSV	
  
•  NoSQL	
  stores	
  
– Document	
  stores	
  
– Spare	
  data	
  
– Rela(onal-­‐like	
  
36	
  
37	
  
Flexible	
  schema	
  	
  
38	
  
Sample	
  query	
  
39	
  
40	
  
Nested	
  data	
  
•  Nested	
  data	
  as	
  first	
  class	
  en(ty	
  
– Similar	
  to	
  BigQuery	
  
– No	
  upfront	
  flahening	
  required	
  
– JSON,	
  BSON,	
  AVRO,	
  Protocol	
  buffers	
  
41	
  
Cross	
  data	
  source	
  queries	
  
•  Combilne	
  data	
  from	
  Files,	
  HBASE,	
  Hive	
  in	
  one	
  
single	
  query	
  
•  No	
  central	
  metadata	
  defini(ons	
  necessary	
  
42	
  
High	
  level	
  architecture	
  
•  Cluster	
  of	
  drillbits,	
  one	
  per	
  node,	
  designed	
  to	
  maximize	
  data	
  locality	
  
•  Form	
  a	
  distributed	
  query	
  processing	
  engine	
  
•  Zookeeper	
  for	
  cluster	
  membership	
  only	
  
•  Hazelcast	
  distributed	
  cache	
  for	
  query	
  plans,	
  metadata,	
  locality	
  informa(on	
  
•  Columnar	
  record	
  organiza(on	
  
•  No	
  dependency	
  on	
  other	
  execu(on	
  engines	
  (Mapreduce,	
  Tez,	
  Spark)	
  
43	
  
Basic	
  query	
  flow	
  
44	
  
Drillbit	
  modules	
  
•  SQL	
  parser	
  
•  Op(mizer	
  
•  execu(on	
  
•  Query	
  execu(on	
  
– source	
  query:	
  what	
  
– logical	
  plan:	
  what	
  
– physical	
  plan:	
  how	
  
– execu(on	
  plan:	
  where	
  
45	
  
46	
  
Op(mis(c	
  execu(on	
  
•  Short	
  running	
  query	
  
– No	
  checkpoints	
  
– Rerun	
  en(re	
  query	
  in	
  face	
  of	
  failure	
  
•  No	
  barriers	
  
•  No	
  persistence	
  
47	
  
Run(me	
  compila(on	
  
48	
  
Roadmap	
  
49	
  

More Related Content

What's hot

14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
rameswara reddy venkat
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
samthemonad
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
NAVER D2
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
Subhas Kumar Ghosh
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
Radek Maciaszek
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
Takahiro Inoue
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
moai kids
 
Scalding
ScaldingScalding
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
Lambda Tree
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
Hadoop User Group
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
Dhafer Malouche
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
moai kids
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
Brian O'Neill
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 
Report
ReportReport
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
Mohamed Elsaka
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
Workhorse Computing
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
Dr Ganesh Iyer
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
Jeffrey Breen
 
C07.heaps
C07.heapsC07.heaps

What's hot (20)

14 query processing-sorting
14 query processing-sorting14 query processing-sorting
14 query processing-sorting
 
Spark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with SparkSpark 4th Meetup Londond - Building a Product with Spark
Spark 4th Meetup Londond - Building a Product with Spark
 
(Julien le dem) parquet
(Julien le dem)   parquet(Julien le dem)   parquet
(Julien le dem) parquet
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 
Extending lifespan with Hadoop and R
Extending lifespan with Hadoop and RExtending lifespan with Hadoop and R
Extending lifespan with Hadoop and R
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章Hadoop本 輪読会 1章〜2章
Hadoop本 輪読会 1章〜2章
 
Scalding
ScaldingScalding
Scalding
 
Python for R developers and data scientists
Python for R developers and data scientistsPython for R developers and data scientists
Python for R developers and data scientists
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
R the unsung hero of Big Data
R the unsung hero of Big DataR the unsung hero of Big Data
R the unsung hero of Big Data
 
ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)ちょっとHadoopについて語ってみるか(仮題)
ちょっとHadoopについて語ってみるか(仮題)
 
Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)Ruby on Big Data (Cassandra + Hadoop)
Ruby on Big Data (Cassandra + Hadoop)
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Report
ReportReport
Report
 
Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Perly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data RecordsPerly Parallel Processing of Fixed Width Data Records
Perly Parallel Processing of Fixed Width Data Records
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
R + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop clusterR + 15 minutes = Hadoop cluster
R + 15 minutes = Hadoop cluster
 
C07.heaps
C07.heapsC07.heaps
C07.heaps
 

Viewers also liked

Social media strategies for libraries poster
Social media strategies for libraries posterSocial media strategies for libraries poster
Social media strategies for libraries poster
Nataly Blas
 
Practica 2 quimica organica -espol
Practica 2  quimica organica -espolPractica 2  quimica organica -espol
Practica 2 quimica organica -espol
Lissy Rodriguez
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
Viet-Trung TRAN
 
The Rules - SGS
The Rules - SGSThe Rules - SGS
The Rules - SGS
Tania Kasongo
 
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Paul Brown
 
The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16
Sightlines
 
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Dave McClure
 
Balanceo de una ecuación química
Balanceo de una ecuación químicaBalanceo de una ecuación química
Balanceo de una ecuación química
dopamina mexico
 
teaching methods
teaching methods teaching methods
teaching methods
estefycoronel
 
Moving to the Right Side of Safety
Moving to the Right Side of SafetyMoving to the Right Side of Safety
Moving to the Right Side of Safety
SAMTRAC International
 
God Is Forgiving
God Is ForgivingGod Is Forgiving
God Is Forgiving
William Harris
 
xoxooo tkmmm
xoxooo tkmmmxoxooo tkmmm
xoxooo tkmmm
ceny2
 
Guia De Estudio Digestivo
Guia De Estudio DigestivoGuia De Estudio Digestivo
Guia De Estudio Digestivo
Luciana Yohai
 
Jobs consultant
Jobs consultantJobs consultant
Jobs consultant
Tenforce
 
Jvm mbeans jmxtran
Jvm mbeans jmxtranJvm mbeans jmxtran
Jvm mbeans jmxtran
adm_exoplatform
 
How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website.
Liquis Design
 
William Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsWilliam Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of Millions
Tric Park
 
Latin Dansları
Latin DanslarıLatin Dansları
Latin Dansları
Busrawien28
 
Charitable Giving and Happiness
Charitable Giving and HappinessCharitable Giving and Happiness
Charitable Giving and Happiness
Faircom New York
 
Torque
TorqueTorque
Torque
caitlinforan
 

Viewers also liked (20)

Social media strategies for libraries poster
Social media strategies for libraries posterSocial media strategies for libraries poster
Social media strategies for libraries poster
 
Practica 2 quimica organica -espol
Practica 2  quimica organica -espolPractica 2  quimica organica -espol
Practica 2 quimica organica -espol
 
Tachyon memory centric, fault tolerance storage for cluster framworks
Tachyon  memory centric, fault tolerance storage for cluster framworksTachyon  memory centric, fault tolerance storage for cluster framworks
Tachyon memory centric, fault tolerance storage for cluster framworks
 
The Rules - SGS
The Rules - SGSThe Rules - SGS
The Rules - SGS
 
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
Challenging our Notions of Learning: Understanding How Web 2.0 Technology Wor...
 
The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16The State of Facilities at Eastern Region Institutions JUNE16
The State of Facilities at Eastern Region Institutions JUNE16
 
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
Ultimate Platform Hotness Smackdown (Twitter, Facebook, iPhone, Native Web / ...
 
Balanceo de una ecuación química
Balanceo de una ecuación químicaBalanceo de una ecuación química
Balanceo de una ecuación química
 
teaching methods
teaching methods teaching methods
teaching methods
 
Moving to the Right Side of Safety
Moving to the Right Side of SafetyMoving to the Right Side of Safety
Moving to the Right Side of Safety
 
God Is Forgiving
God Is ForgivingGod Is Forgiving
God Is Forgiving
 
xoxooo tkmmm
xoxooo tkmmmxoxooo tkmmm
xoxooo tkmmm
 
Guia De Estudio Digestivo
Guia De Estudio DigestivoGuia De Estudio Digestivo
Guia De Estudio Digestivo
 
Jobs consultant
Jobs consultantJobs consultant
Jobs consultant
 
Jvm mbeans jmxtran
Jvm mbeans jmxtranJvm mbeans jmxtran
Jvm mbeans jmxtran
 
How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website. How to increase traffic to your WordPress website.
How to increase traffic to your WordPress website.
 
William Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of MillionsWilliam Gross Sues Pimco for Hundreds of Millions
William Gross Sues Pimco for Hundreds of Millions
 
Latin Dansları
Latin DanslarıLatin Dansları
Latin Dansları
 
Charitable Giving and Happiness
Charitable Giving and HappinessCharitable Giving and Happiness
Charitable Giving and Happiness
 
Torque
TorqueTorque
Torque
 

Similar to Interactive big data analytics

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
Neville Li
 
Handout3o
Handout3oHandout3o
Handout3o
Shahbaz Sidhu
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
Victoria Malaya
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
Martin Dvorak
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Xiao Qin
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
Amazon Web Services
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Data Con LA
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
Collin Bennett
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
Francisco Pérez-Sorrosal
 
User biglm
User biglmUser biglm
User biglm
johnatan pladott
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
rantav
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
Cleverence Kombe
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
Andrey Lomakin
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
Amazon Web Services
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
InfluxData
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
Mao Geng
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
DataWorks Summit
 

Similar to Interactive big data analytics (20)

Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Handout3o
Handout3oHandout3o
Handout3o
 
MongoDB 3.0
MongoDB 3.0 MongoDB 3.0
MongoDB 3.0
 
Google Cluster Innards
Google Cluster InnardsGoogle Cluster Innards
Google Cluster Innards
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop ClustersHDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
 
(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive(DAT401) Amazon DynamoDB Deep Dive
(DAT401) Amazon DynamoDB Deep Dive
 
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
Big Data Day LA 2015 - Compiling DSLs for Diverse Execution Environments by Z...
 
Processing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive ComputingProcessing Big Data: An Introduction to Data Intensive Computing
Processing Big Data: An Introduction to Data Intensive Computing
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
User biglm
User biglmUser biglm
User biglm
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clustersMap reduce - simplified data processing on large clusters
Map reduce - simplified data processing on large clusters
 
Apache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data modelApache Cassandra, part 1 – principles, data model
Apache Cassandra, part 1 – principles, data model
 
Data Collection and Storage
Data Collection and StorageData Collection and Storage
Data Collection and Storage
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache TezYahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
Yahoo - Moving beyond running 100% of Apache Pig jobs on Apache Tez
 

More from Viet-Trung TRAN

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Viet-Trung TRAN
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
Viet-Trung TRAN
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
Viet-Trung TRAN
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
Viet-Trung TRAN
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Viet-Trung TRAN
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
Viet-Trung TRAN
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
Viet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
Viet-Trung TRAN
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
Viet-Trung TRAN
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
Viet-Trung TRAN
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
Viet-Trung TRAN
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
Viet-Trung TRAN
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
Viet-Trung TRAN
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
Viet-Trung TRAN
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Viet-Trung TRAN
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
Viet-Trung TRAN
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
Viet-Trung TRAN
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
Viet-Trung TRAN
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
Viet-Trung TRAN
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
Viet-Trung TRAN
 

More from Viet-Trung TRAN (20)

Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
Bắt đầu tìm hiểu về dữ liệu lớn như thế nào - 2017
 
Dynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value StoreDynamo: Amazon’s Highly Available Key-value Store
Dynamo: Amazon’s Highly Available Key-value Store
 
Pregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớnPregel: Hệ thống xử lý đồ thị lớn
Pregel: Hệ thống xử lý đồ thị lớn
 
Mapreduce simplified-data-processing
Mapreduce simplified-data-processingMapreduce simplified-data-processing
Mapreduce simplified-data-processing
 
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của FacebookTìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
Tìm kiếm needle trong Haystack: Hệ thống lưu trữ ảnh của Facebook
 
giasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case studygiasan.vn real-estate analytics: a Vietnam case study
giasan.vn real-estate analytics: a Vietnam case study
 
Giasan.vn @rstars
Giasan.vn @rstarsGiasan.vn @rstars
Giasan.vn @rstars
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
A Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural NetworkA Vietnamese Language Model Based on Recurrent Neural Network
A Vietnamese Language Model Based on Recurrent Neural Network
 
Large-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on SparkLarge-Scale Geographically Weighted Regression on Spark
Large-Scale Geographically Weighted Regression on Spark
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
success factors for project proposals
success factors for project proposalssuccess factors for project proposals
success factors for project proposals
 
GPSinsights poster
GPSinsights posterGPSinsights poster
GPSinsights poster
 
OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents OCR processing with deep learning: Apply to Vietnamese documents
OCR processing with deep learning: Apply to Vietnamese documents
 
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
Paper@Soict2015: GPSInsights: towards a scalable framework for mining massive...
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015Introduction to BigData @TCTK2015
Introduction to BigData @TCTK2015
 
From neural networks to deep learning
From neural networks to deep learningFrom neural networks to deep learning
From neural networks to deep learning
 
From decision trees to random forests
From decision trees to random forestsFrom decision trees to random forests
From decision trees to random forests
 
Recommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filteringRecommender systems: Content-based and collaborative filtering
Recommender systems: Content-based and collaborative filtering
 

Recently uploaded

bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
Divyam548318
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
insn4465
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
SUTEJAS
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
IJECEIAES
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
IJECEIAES
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
rpskprasana
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
mamunhossenbd75
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
PauloRodrigues104553
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
anoopmanoharan2
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
Ratnakar Mikkili
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
gestioneergodomus
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
gerogepatton
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
wisnuprabawa3
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 

Recently uploaded (20)

bank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdfbank management system in java and mysql report1.pdf
bank management system in java and mysql report1.pdf
 
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
哪里办理(csu毕业证书)查尔斯特大学毕业证硕士学历原版一模一样
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
Understanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine LearningUnderstanding Inductive Bias in Machine Learning
Understanding Inductive Bias in Machine Learning
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
Embedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoringEmbedded machine learning-based road conditions and driving behavior monitoring
Embedded machine learning-based road conditions and driving behavior monitoring
 
Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...Advanced control scheme of doubly fed induction generator for wind turbine us...
Advanced control scheme of doubly fed induction generator for wind turbine us...
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
CSM Cloud Service Management Presentarion
CSM Cloud Service Management PresentarionCSM Cloud Service Management Presentarion
CSM Cloud Service Management Presentarion
 
Heat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation pptHeat Resistant Concrete Presentation ppt
Heat Resistant Concrete Presentation ppt
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
 
PPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testingPPT on GRP pipes manufacturing and testing
PPT on GRP pipes manufacturing and testing
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Exception Handling notes in java exception
Exception Handling notes in java exceptionException Handling notes in java exception
Exception Handling notes in java exception
 
DfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributionsDfMAy 2024 - key insights and contributions
DfMAy 2024 - key insights and contributions
 
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELDEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODEL
 
New techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdfNew techniques for characterising damage in rock slopes.pdf
New techniques for characterising damage in rock slopes.pdf
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 

Interactive big data analytics

  • 1. Interac(ve  Big  data  analysis   Viet-­‐Trung  Tran   1  
  • 3. MR  –  batch  processing   •  Long  running  job   – latency  between  running  the  job  and  geBng  the   answer   •  Lot  of  computa(ons   •  Specific  language   3  
  • 4. Example  Problem   •  Jane  works  as  an   analyst  at  an  e-­‐ commerce  company   •  How  does  she  figure   out  good  targe(ng   segments  for  the  next   marke(ng  campaign?   •  She  has  some  ideas   and  lots  of  data   User     profiles   Transac.on   informa.on   Access   logs   4  
  • 5. Solving  the  problems?   All  compiled  to  Map  Reduce  jobs   5  
  • 6. Dremel:  interac(ve  analysis  of   web-­‐scale  datasets   Melnik  et.  al,  Google  inc   [VLDB  2010]   6  
  • 7. What  is  Dremel?   •  Near  real  (me  interac(ve  analysis  (instead  batch   processing).  SQL-­‐like  query  language   –  Trillion  record,  mul(-­‐terabyte  datasets   •  Nested  data  with  a  column  storage  representa(on   •  Serving  tree:  mul(-­‐level  execu(on  trees  for  query   processing   •  Interoperates  "in  place"  with  GFS,  Big  Table   •  The  engine  behind  Google  BigQuery   •  Builds  on  the  ideas  from  web  search  and  parallel   DBMS.   7  
  • 8. •  Brand of power tools that primarily rely on their speed as opposed to torque •  Data analysis tool that uses speed instead of raw power Why call it Dremel 8  
  • 9. Widely used inside Google •  Analysis of crawled web documents •  Tracking install data for applications on Android Market •  Crash reporting for Google products •  OCR results from Google Books •  Spam analysis •  Debugging of map tiles on Google Maps •  Tablet migrations in managed Bigtable instances •  Results of tests run on Google's distributed build system •  Disk I/O statistics for hundreds of thousands of disks •  Resource monitoring for jobs run in Google's data centers •  Symbols and dependencies in Google's codebase 9  
  • 10. Records vs. columns A   B   C   D   E   *   *   *   .  .  .   .  .  .   r1   r2   r1   r2   r1   r2   r1   r2   Challenge: preserve structure, reconstruct from a subset of fields Read less, cheaper decompression 10  
  • 11. Columnar  format   •  Values  in  a  column  stored  next  to  one  another   – Beher  compression   – Range-­‐map:  save  min-­‐max   •  Only  access  columns  par(cipa(ng  in  query   •  Aggrega(ons  can  be  done  without  decoding   11  
  • 12. Nested data model message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r1   DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r2   multiplicity: 12  
  • 13. Column-striped representation value r d 10 0 0 20 0 0 DocId value r d http://A 0 2 http://B 1 2 NULL 1 1 http://C 0 2 Name.Url value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 Name.Language.Code Name.Language.Country Links.BackwardLinks.Forward value r d us 0 3 NULL 2 2 NULL 1 1 gb 1 3 NULL 0 1 value r d 20 0 2 40 1 2 60 1 2 80 0 2 value r d NULL 0 1 10 0 2 30 1 2 13  
  • 14. Repetition and definition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' r1   DocId: 20 Links Backward: 10 Backward: 30 Forward: 80 Name Url: 'http://C' r2   value r d en-us 0 2 en 2 2 NULL 1 1 en-gb 1 2 NULL 0 1 Name.Language.Code r: At what repeated field in the field's path the value has repeated   d: How many fields in paths that could be undefined (opt. or rep.) are actually present   record (r=0) has repeated   r=2  r=1   Language (r=2) has repeated   (non-repeating)   14  
  • 15. Record assembly FSM   message Document { required int64 DocId; [1,1] optional group Links { repeated int64 Backward; [0,*] repeated int64 Forward; } repeated group Name { repeated group Language { required string Code; optional string Country; [0,1] } optional string Url; } } Name.Language.CountryName.Language.Code Links.Backward Links.Forward Name.Url DocId 1   0   1   0   0,1,2   2   0,1  1   0   0   Transitions labeled with repetition levels 15  
  • 16. Record assembly FSM: example Name.Language.CountryName.Language.Code Links.Backward Links.Forward Name.Url DocId 1   0   1   0   0,1,2   2   0,1  1   0   0   Transitions labeled with repetition levels DocId: 10 Links Forward: 20 Forward: 40 Forward: 60 Name Language Code: 'en-us' Country: 'us' Language Code: 'en' Url: 'http://A' Name Url: 'http://B' Name Language Code: 'en-gb' Country: 'gb' 16  
  • 17. Reading two fields DocId Name.Language.Country1,2   0   0   DocId: 10 Name Language Country: 'us' Language Name Name Language Country: 'gb' DocId: 20 Name s1   s2   Structure of parent fields is preserved. Useful for queries like /Name[3]/Language[1]/Country 17  
  • 18. Query processing •  Optimized for select-project-aggregate – Very common class of interactive queries – Single scan – Within-record and cross-record aggregation •  Approximations: count(distinct), top-k •  Joins, temp tables, UDFs/TVFs, etc. 18  
  • 19. SQL dialect for nested data Id: 10 Name Cnt: 2 Language Str: 'http://A,en-us' Str: 'http://A,en' Name Cnt: 0 t1   SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; message QueryResult { required int64 Id; repeated group Name { optional uint64 Cnt; repeated group Language { optional string Str; } } } Output table   Output schema   No record assembly during query processing   19  
  • 20. Serving tree storage layer (e.g., GFS) . . .   . . .   . . .  leaf servers (with local storage)   intermediate servers   root server   client   !" !#$" !#%" !#&" !#'" !#(" !#)" !" %" '" )" *" $!" $%" $'" $)" histogram of response times   20  
  • 21. Mul(-­‐level  serving  tree   •  Parallelizes scheduling and aggregation – Reduced fan-in – Divide/conquer – Better network utilization •  Fault tolerance 21  
  • 22. Example: count() SELECT A, COUNT(B) FROM T GROUP BY A T = {/gfs/1, /gfs/2, …, /gfs/100000} SELECT A, SUM(c) FROM (R11 UNION ALL R110) GROUP BY A SELECT A, COUNT(B) AS c FROM T11 GROUP BY A T11 = {/gfs/1, …, /gfs/10000} SELECT A, COUNT(B) AS c FROM T12 GROUP BY A T12 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c FROM T31 GROUP BY A T31 = {/gfs/1} . . .   0   1   3   R11   R12   Data access ops   . . .   . . .   22  
  • 23. Experiments Table name Number of records Size (unrepl., compressed) Number of fields Data center Repl. factor T1 85 billion 87 TB 270 A 3× T2 24 billion 13 TB 530 A 3× T3 4 billion 70 TB 1200 A 3× T4 1+ trillion 105 TB 50 B 3× T5 1+ trillion 20 TB 30 B 2× •  1 PB of real data (uncompressed, non-replicated) •  100K-800K tablets per table •  Experiments run during business hours 23  
  • 24. !" #" $" %" &" '!" '#" '$" '%" '&" #!" '" #" (" $" )" %" *" &" +" '!" Read from disk columns   records   objects   fromrecords  fromcolumns   (a) read + decompress   (b) assemble records   (c) parse as C++ objects   (d) read + decompress   (e) parse as C++ objects   time (sec)   number of fields   Table partition: 375 MB (compressed), 300K rows, 125 columns   2-4x overhead of using records 10x speedup using columnar storage 24  
  • 25. MR and Dremel execution Sawzall program ran on MR: num_recs: table sum of int; num_words: table sum of int; emit num_recs <- 1; emit num_words <- count_words(input.txtField);!" !#" !##" !###" !####" $%&'()*'+," $%&)*-./0," 1'(/(-" execution time (sec) on 3000 nodes   SELECT SUM(count_words(txtField)) / COUNT(*) FROM T1 Q1:   87 TB   0.5 TB   0.5 TB   MR overheads: launch jobs, schedule 0.5M tasks, assemble records Avg # of terms in txtField in 85 billion record table T1   25  
  • 26. Impact of serving tree depth !" #!" $!" %!" &!" '!" (!" )$" )%" $"*+,+*-" %"*+,+*-" &"*+,+*-" execution time (sec)   SELECT country, SUM(item.amount) FROM T2
 GROUP BY country SELECT domain, SUM(item.amount) FROM T2
 WHERE domain CONTAINS ’.net’
 GROUP BY domain Q2: Q3: 40 billion nested items (returns 100s of records) (returns 1M records) 26  
  • 27. !" #!" $!!" $#!" %!!" %#!" $!!!" %!!!" &!!!" '!!!" Scalability execution time (sec)   number of leaf servers   SELECT TOP(aid, 20), COUNT(*) FROM T4 Q5 on a trillion-row table T4: 27  
  • 28. Interactive speed !" #" $!" $#" %!" %#" &!" $" $!" $!!" $!!!" execution time (sec)   percentage of queries Most queries complete under 10 sec Monthly query workload of one 3000-node Dremel instance 28  
  • 29. Observations •  Possible to analyze large disk-resident datasets interactively on commodity hardware –  1T records, 1000s of nodes •  MR can benefit from columnar storage just like a parallel DBMS –  But record assembly is expensive –  Interactive SQL and MR can be complementary •  Parallel DBMSes may benefit from serving tree architecture just like search engines 29  
  • 30. Vs.  MapReduce   •  Scheduling  Model   –  Coarse  resource  model  reduces  hardware  u(liza(on   –  Acquisi(on  of  resources  typically  takes  100’s  of  millis  to  seconds   •  Barriers   –  Map  comple(on  required  before  shuffle/reduce   commencement   –  All  maps  must  complete  before  reduce  can  start   –  In  chained  jobs,  one  job  must  finish  en(rely  before  the  next  one   can  start   •  Persistence  and  Recoverability   –  Data  is  persisted  to  disk  between  each  barrier   –  Serializa(on  and  deserializa(on  are  required  between  execu(on   phase   30  
  • 32. 32  
  • 33. 33  
  • 34. 34  
  • 35. Full  SQL  –  ANSI  SQL  2003   •  SQL  like  is  not  enough   •  Fine  integra(on  with  exis(ng  BI  tools   – Tableau,  SAP   – Standard  ODBC/JDBC  driver   35  
  • 36. Working  data   •  Flat  files  in  DFS   – Complex  data  (thrif,  Avro,  protobuf)   – Columnar  data  (Parquet,  ORC)   – JSON   – CSV,  TSV   •  NoSQL  stores   – Document  stores   – Spare  data   – Rela(onal-­‐like   36  
  • 37. 37  
  • 40. 40  
  • 41. Nested  data   •  Nested  data  as  first  class  en(ty   – Similar  to  BigQuery   – No  upfront  flahening  required   – JSON,  BSON,  AVRO,  Protocol  buffers   41  
  • 42. Cross  data  source  queries   •  Combilne  data  from  Files,  HBASE,  Hive  in  one   single  query   •  No  central  metadata  defini(ons  necessary   42  
  • 43. High  level  architecture   •  Cluster  of  drillbits,  one  per  node,  designed  to  maximize  data  locality   •  Form  a  distributed  query  processing  engine   •  Zookeeper  for  cluster  membership  only   •  Hazelcast  distributed  cache  for  query  plans,  metadata,  locality  informa(on   •  Columnar  record  organiza(on   •  No  dependency  on  other  execu(on  engines  (Mapreduce,  Tez,  Spark)   43  
  • 45. Drillbit  modules   •  SQL  parser   •  Op(mizer   •  execu(on   •  Query  execu(on   – source  query:  what   – logical  plan:  what   – physical  plan:  how   – execu(on  plan:  where   45  
  • 46. 46  
  • 47. Op(mis(c  execu(on   •  Short  running  query   – No  checkpoints   – Rerun  en(re  query  in  face  of  failure   •  No  barriers   •  No  persistence   47