SlideShare a Scribd company logo
Dremel - Interactive Analysis of
               Web-Scale Dataset
                                    Paper Review

                                  Arinto Murdopo

                                 November 8, 2012


1 Motivation
The main motivation of the paper is the inability of existing big data infrastructure
(MapReduce-BigTable and Hadoop) to perform fast ad-hoc explorations/queries into
web-scale data-sets. In this context, ad-hoc explorations/queries mean on-the-fly queries,
which are issued by users when they need it. The execution time of ad-hoc queries is
expected to be fast so that users can interactively explore the datasets.
  Secondary motivation of the paper is the needs of more user-friendly query mechanism
to make data analytic in big data infrastructure easier and faster. Pig and Hive try to
solve this challenge by providing SQL-like query language in Hadoop, but they only solve
the usability issue and not the performance issue.
  Solving these issues allows faster, more efficient and more effective data analytic in big
data infrastructure which implies higher productivity for data scientists and engineers
who are working in big data analytic. Hence, these issues are interesting and important
to be solved.


2 Contributions
The main contribution of Dremel is high-performance technique in processing web-scale
data-set for ad-hoc query usage. The mechanism solves these following challenges:

  1. Inefficiency in storing data for ad-hoc query.
     Ad-hoc queries most of the time do not need all the available field/column in
     a table. Therefore the authors propose columnar data model that improve data
     retrieval performance. It was novel solution because well-known big data processing
     platforms (such as MapReduce on Hadoop) work at record-structure data model
     at that time.

  2. Performance and scalability challenge in processing the data.



                                           1
Multi-level serving tree model allows high parallelism in processing the data. It
      also allows optimization of query in each of the tree level. Query scheduling allows
      prioritization of the execution. Both techniques solve performance challenge in
      data processing.
      Scalability challenge is also solved since we can easily add more nodes to process
      more data in multi-level serving tree model. Separation of query scheduler from
      root server is good because it decouples query scheduling responsibility with job
      tracking (note that: job tracking is performed by root server). This scheduler
      model implies higher scalability when the number of leaf servers is huge.

  The minor contribution of this paper is the use of actual Google data set in the experi-
ment. This usage gives other researchers insight on the practical magnitude of web-scale
data-sets


3 Solutions
3.1 Columnar Data Model
As mentioned before, ad-hoc queries only need small subset of fields/columns in tables.
Record-structure data model introduces significant inefficiencies. The reason is in order
to retrieve the needed fields/columns, we need to read the whole record data, including
unnecessary fields/columns. To reduce these inefficiencies, columnar data model is intro-
duced. Columnar data model allows us to read only the needed fields/columns for ad hoc
queries. This model will reduce the data retrieval inefficiencies and increase the speed.
The authors also explain how to convert from record-structure data model to columnar
model and vice versa.

3.2 Multi-level Serving Tree
The authors use multi-level serving tree in columnar data processing. Typical multi-level
serving tree consists of a root server, several intermediate servers and many leaf servers.
There are two reasons behind multi-level serving tree:

  1. Characteristics of ad hoc queries where the result set size is small or medium. The
     overhead of processing these kinds of data in parallel is small.

  2. High degree of parallelism to process small or medium-size data.

3.3 Query Dispatcher
The authors propose a mechanism to regulate resource allocation for each leaf servers and
the mechanism is handled by module called query dispatcher. Query dispatcher works
in slot(number of processing unit available for execution) unit. It allocates appropriate
number of tablets into their respective slot. It deals with stragglers by moving the tablets
from slow straggler slots into new slots.



                                             2
3.4 SQL-like query
People with SQL background can easily use Dremel to perform data analytics in web-
scale datasets. Unlike Pig and Hive, Dremel does not convert SQL-like queries into
MapReduce jobs, therefore Dremel should have faster execution time compared to Pig
and Hive.


4 Strong Points
  1. Identification of the characteristics of ad-hoc queries data set.
     The authors correctly identify the main characteristic of data set returned from
     ad-hoc queries, which is: only small number of fields are used by ad hoc-queries.
     This finding allows the authors to develop columnar data model and use multi-level
     serving tree to process the columnar data model.
  2. Fast and lossless conversion between nested record structure model and columnar
     data model.
     Although columnar data model has been used in other related works, the fast and
     lossless conversion algorithm that the authors propose is novel and one of the key
     contributions of Dremel.
  3. Magnitude and variety of datasets for experiments.
     The magnitude of datasets is huge and practical. These magnitude and variety
     of data-sets increase the confirming power of Dremel solution and proof its high
     performance.


5 Weak Points
  1. Record-oriented data model can still outperform columnar data model. This is
     the main shortcoming of Dremel, however credit must be given on the authors
     since they do not hide this shortcoming and they provide some insight on this
     shortcoming.
  2. Performance analysis on the data model conversion is not discussed. They claim
     that the conversion is fast, but they do not support this argument using experiment
     data.
  3. "Cold" setting usage in Local Disk experiment. In Local Disk experiment, the au-
     thors mention that "all reported times are cold". Using cold setting in database
     or storage benchmarking is not recommended because the data is highly biased
     with disk access performance. When the database is "cold", query execution per-
     formance in the start of the experiment will highly depend on the disk speed. In
     the start of the experiment, most of the operations involve moving data from disk
     to OS cache, and the execution performance will be dominated by disk access.




                                           3

More Related Content

What's hot

개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?
choi kyumin
 
人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記
人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記
人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記
心 谷本
 
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
승호 박
 
[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법
[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법
[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법
강 민우
 
Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기
Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기
Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기
흥래 김
 
ソースコードの品質向上のための効果的で効率的なコードレビュー
ソースコードの品質向上のための効果的で効率的なコードレビューソースコードの品質向上のための効果的で効率的なコードレビュー
ソースコードの品質向上のための効果的で効率的なコードレビューMoriharu Ohzu
 
Redmineチケットによるプロジェクト火消し戦略!
Redmineチケットによるプロジェクト火消し戦略!Redmineチケットによるプロジェクト火消し戦略!
Redmineチケットによるプロジェクト火消し戦略!
TrinityT _
 
【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来
【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来
【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来
Unity Technologies Japan K.K.
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
Sadayuki Furuhashi
 
NDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdf
NDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdfNDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdf
NDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdf
Jongwon Kim
 
Fun QA_재미지수_GENII
Fun QA_재미지수_GENIIFun QA_재미지수_GENII
Fun QA_재미지수_GENII
원철 정
 
쩌는게임기획서 이렇게 쓴다
쩌는게임기획서 이렇게 쓴다쩌는게임기획서 이렇게 쓴다
쩌는게임기획서 이렇게 쓴다
Jinho Jung
 
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 [데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
choi kyumin
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
Juhong Park
 
중니어의 고뇌: 1인분 개발자, 다음을 찾아서
중니어의 고뇌: 1인분 개발자, 다음을 찾아서중니어의 고뇌: 1인분 개발자, 다음을 찾아서
중니어의 고뇌: 1인분 개발자, 다음을 찾아서
Yurim Jin
 
PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!
Kouhei Sutou
 
(독서광) 제품의 탄생
(독서광) 제품의 탄생(독서광) 제품의 탄생
(독서광) 제품의 탄생
Jay Park
 
Git Flowを運用するために
Git Flowを運用するためにGit Flowを運用するために
Git Flowを運用するために
Shun Tsunoda
 
Squad health-check-model v2-fr
Squad health-check-model v2-frSquad health-check-model v2-fr
Squad health-check-model v2-fr
Fabrice Aimetti
 
로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법
Jeongsang Baek
 

What's hot (20)

개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?개인화 추천은 어디로 가고 있는가?
개인화 추천은 어디로 가고 있는가?
 
人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記
人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記
人気番組との戦い! Javaシステムのパフォーマンスチューニング奮闘記
 
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
하이퍼커넥트에서 자동 광고 측정 서비스 구현하기 - PyCon Korea 2018
 
[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법
[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법
[IGC 2016] 스튜디오 EIM 정사인- 실패하지 않는 게임 사운드 제작 접근법
 
Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기
Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기
Elasticsearch와 Python을 이용하여 맨땅에서 데이터 분석하기
 
ソースコードの品質向上のための効果的で効率的なコードレビュー
ソースコードの品質向上のための効果的で効率的なコードレビューソースコードの品質向上のための効果的で効率的なコードレビュー
ソースコードの品質向上のための効果的で効率的なコードレビュー
 
Redmineチケットによるプロジェクト火消し戦略!
Redmineチケットによるプロジェクト火消し戦略!Redmineチケットによるプロジェクト火消し戦略!
Redmineチケットによるプロジェクト火消し戦略!
 
【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来
【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来
【Unite 2017 Tokyo】ゲームAI・ゲームデザインから考えるゲームの過去・現在・未来
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
NDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdf
NDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdfNDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdf
NDC21_게임테스트자동화5년의기록_NCSOFT_김종원.pdf
 
Fun QA_재미지수_GENII
Fun QA_재미지수_GENIIFun QA_재미지수_GENII
Fun QA_재미지수_GENII
 
쩌는게임기획서 이렇게 쓴다
쩌는게임기획서 이렇게 쓴다쩌는게임기획서 이렇게 쓴다
쩌는게임기획서 이렇게 쓴다
 
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기 [데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
[데이터야놀자2107] 강남 출근길에 판교/정자역에 내릴 사람 예측하기
 
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
[KAIST 채용설명회] 데이터 엔지니어는 무슨 일을 하나요?
 
중니어의 고뇌: 1인분 개발자, 다음을 찾아서
중니어의 고뇌: 1인분 개발자, 다음을 찾아서중니어의 고뇌: 1인분 개발자, 다음을 찾아서
중니어의 고뇌: 1인분 개발자, 다음을 찾아서
 
PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!PGroonga – Make PostgreSQL fast full text search platform for all languages!
PGroonga – Make PostgreSQL fast full text search platform for all languages!
 
(독서광) 제품의 탄생
(독서광) 제품의 탄생(독서광) 제품의 탄생
(독서광) 제품의 탄생
 
Git Flowを運用するために
Git Flowを運用するためにGit Flowを運用するために
Git Flowを運用するために
 
Squad health-check-model v2-fr
Squad health-check-model v2-frSquad health-check-model v2-fr
Squad health-check-model v2-fr
 
로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법로그 기깔나게 잘 디자인하는 법
로그 기깔나게 잘 디자인하는 법
 

Viewers also liked

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
MapR Technologies
 
Uso correto de epi´s abafadores
Uso correto de epi´s   abafadoresUso correto de epi´s   abafadores
Uso correto de epi´s abafadores
Paulo Carvalho
 
Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.persi-10
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
Arinto Murdopo
 
Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015
Kelvin Glen
 
Arviointi ja palaute 2011
Arviointi ja palaute 2011Arviointi ja palaute 2011
Arviointi ja palaute 2011
Marko Havu
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..
Arinto Murdopo
 
Pechakucha
PechakuchaPechakucha
Pechakucha
cmcgrupo8
 
The counting system for small animals in japanese
The counting system for small animals in japaneseThe counting system for small animals in japanese
The counting system for small animals in japanese
CheyanneStotlar
 
Pankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittelyPankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittelyPankki2
 
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8yaying-yingg
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
Arinto Murdopo
 
Cultura mites
Cultura mitesCultura mites
Cultura mitesComalat1D
 
153 test plan
153 test plan153 test plan
153 test plan< <
 
Moodboards eda
Moodboards edaMoodboards eda
Moodboards edaedaozdemir
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
Arinto Murdopo
 
Maailmassa on parempia pankkeja
Maailmassa on parempia pankkejaMaailmassa on parempia pankkeja
Maailmassa on parempia pankkeja
Pankki2
 

Viewers also liked (20)

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data SetsApache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
 
Uso correto de epi´s abafadores
Uso correto de epi´s   abafadoresUso correto de epi´s   abafadores
Uso correto de epi´s abafadores
 
Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.Practica 2 luis ivan cruz val.
Practica 2 luis ivan cruz val.
 
Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services Intelligent Placement of Datacenter for Internet Services
Intelligent Placement of Datacenter for Internet Services
 
Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015Netcare csi kelvin's talk aug 2015
Netcare csi kelvin's talk aug 2015
 
 
UX homework4
UX homework4UX homework4
UX homework4
 
Arviointi ja palaute 2011
Arviointi ja palaute 2011Arviointi ja palaute 2011
Arviointi ja palaute 2011
 
Distributed Computing - What, why, how..
Distributed Computing - What, why, how..Distributed Computing - What, why, how..
Distributed Computing - What, why, how..
 
Pechakucha
PechakuchaPechakucha
Pechakucha
 
The counting system for small animals in japanese
The counting system for small animals in japaneseThe counting system for small animals in japanese
The counting system for small animals in japanese
 
Sam houston chess team
Sam houston chess teamSam houston chess team
Sam houston chess team
 
Pankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittelyPankki 2.0-hankkeen esittely
Pankki 2.0-hankkeen esittely
 
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8Queens Parh Rangers AD410 น.ส.ฐิติมา  ประเสริฐชัย เลขที่8
Queens Parh Rangers AD410 น.ส.ฐิติมา ประเสริฐชัย เลขที่8
 
Architecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity FabricArchitecting a Cloud-Scale Identity Fabric
Architecting a Cloud-Scale Identity Fabric
 
Cultura mites
Cultura mitesCultura mites
Cultura mites
 
153 test plan
153 test plan153 test plan
153 test plan
 
Moodboards eda
Moodboards edaMoodboards eda
Moodboards eda
 
Quantum Cryptography and Possible Attacks
Quantum Cryptography and Possible AttacksQuantum Cryptography and Possible Attacks
Quantum Cryptography and Possible Attacks
 
Maailmassa on parempia pankkeja
Maailmassa on parempia pankkejaMaailmassa on parempia pankkeja
Maailmassa on parempia pankkeja
 

Similar to Dremel Paper Review

De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
CitiusTech
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataEMC
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
Alireza Kamrani
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
varshakumar21
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
ijcsa
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
ijdms
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbAlexander Decker
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
varshakumar21
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dipayan Dev
 
Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...
avinash varma sagi
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
ijcsit
 
Query Optimization for Big Data Analytics
Query Optimization for Big Data AnalyticsQuery Optimization for Big Data Analytics
Query Optimization for Big Data Analytics
AIRCC Publishing Corporation
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
cscpconf
 
IJSRED-V2I3P84
IJSRED-V2I3P84IJSRED-V2I3P84
IJSRED-V2I3P84
IJSRED
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesEditor Jacotech
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environmentDavid Walker
 
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...Arjun Sirohi
 

Similar to Dremel Paper Review (20)

De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
De-duplicated Refined Zone in Healthcare Data Lake Using Big Data Processing ...
 
Orca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big DataOrca: A Modular Query Optimizer Architecture for Big Data
Orca: A Modular Query Optimizer Architecture for Big Data
 
What is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of databaseWhat is Scalability and How can affect on overall system performance of database
What is Scalability and How can affect on overall system performance of database
 
Data science unit2
Data science unit2Data science unit2
Data science unit2
 
An experimental evaluation of performance
An experimental evaluation of performanceAn experimental evaluation of performance
An experimental evaluation of performance
 
Query optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query managementQuery optimization in oodbms identifying subquery for query management
Query optimization in oodbms identifying subquery for query management
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
A survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo dbA survey on data mining and analysis in hadoop and mongo db
A survey on data mining and analysis in hadoop and mongo db
 
data science chapter-4,5,6
data science chapter-4,5,6data science chapter-4,5,6
data science chapter-4,5,6
 
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
Dr.Hadoop- an infinite scalable metadata management for Hadoop-How the baby e...
 
Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...Context sensitive indexes for performance optimization of sql queries in mult...
Context sensitive indexes for performance optimization of sql queries in mult...
 
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICSQUERY OPTIMIZATION FOR BIG DATA ANALYTICS
QUERY OPTIMIZATION FOR BIG DATA ANALYTICS
 
Query Optimization for Big Data Analytics
Query Optimization for Big Data AnalyticsQuery Optimization for Big Data Analytics
Query Optimization for Big Data Analytics
 
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATIONA BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
A BRIEF REVIEW ALONG WITH A NEW PROPOSED APPROACH OF DATA DE DUPLICATION
 
IJSRED-V2I3P84
IJSRED-V2I3P84IJSRED-V2I3P84
IJSRED-V2I3P84
 
Data management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunitiesData management in cloud study of existing systems and future opportunities
Data management in cloud study of existing systems and future opportunities
 
Facade
FacadeFacade
Facade
 
Data warehousing change in a challenging environment
Data warehousing change in a challenging environmentData warehousing change in a challenging environment
Data warehousing change in a challenging environment
 
disertation
disertationdisertation
disertation
 
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
Context-Sensitive Indexes for Performance Optimization of SQL Queries in Mult...
 

More from Arinto Murdopo

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
Arinto Murdopo
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
Arinto Murdopo
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
Arinto Murdopo
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
Arinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
Arinto Murdopo
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
Arinto Murdopo
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
Arinto Murdopo
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
Arinto Murdopo
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
Arinto Murdopo
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
Arinto Murdopo
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Arinto Murdopo
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Arinto Murdopo
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
Arinto Murdopo
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
Arinto Murdopo
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
Arinto Murdopo
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
Arinto Murdopo
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?
Arinto Murdopo
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?
Arinto Murdopo
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
Arinto Murdopo
 

More from Arinto Murdopo (19)

Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN Next Generation Hadoop: High Availability for YARN
Next Generation Hadoop: High Availability for YARN
 
High Availability in YARN
High Availability in YARNHigh Availability in YARN
High Availability in YARN
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...An Integer Programming Representation for Data Center Power-Aware Management ...
An Integer Programming Representation for Data Center Power-Aware Management ...
 
Quantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slideQuantum Cryptography and Possible Attacks-slide
Quantum Cryptography and Possible Attacks-slide
 
Parallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPIParallelization of Smith-Waterman Algorithm using MPI
Parallelization of Smith-Waterman Algorithm using MPI
 
Megastore - ID2220 Presentation
Megastore - ID2220 PresentationMegastore - ID2220 Presentation
Megastore - ID2220 Presentation
 
Flume Event Scalability
Flume Event ScalabilityFlume Event Scalability
Flume Event Scalability
 
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - SlideLarge Scale Distributed Storage Systems in Volunteer Computing - Slide
Large Scale Distributed Storage Systems in Volunteer Computing - Slide
 
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing SystemsLarge-Scale Decentralized Storage Systems for Volunter Computing Systems
Large-Scale Decentralized Storage Systems for Volunter Computing Systems
 
Rise of Network Virtualization
Rise of Network VirtualizationRise of Network Virtualization
Rise of Network Virtualization
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
Distributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer ComputingDistributed Storage System for Volunteer Computing
Distributed Storage System for Volunteer Computing
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 
Why File Sharing is Dangerous?
Why File Sharing is Dangerous?Why File Sharing is Dangerous?
Why File Sharing is Dangerous?
 
Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?Why Use “REST” Architecture for Web Services?
Why Use “REST” Architecture for Web Services?
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 

Recently uploaded

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
Delapenabediema
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
Special education needs
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
Jheel Barad
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
beazzy04
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
Atul Kumar Singh
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Po-Chuan Chen
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
RaedMohamed3
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
JosvitaDsouza2
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
CarlosHernanMontoyab2
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
vaibhavrinwa19
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
kaushalkr1407
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
DeeptiGupta154
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
Peter Windle
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
siemaillard
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
Jisc
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
EduSkills OECD
 

Recently uploaded (20)

The Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official PublicationThe Challenger.pdf DNHS Official Publication
The Challenger.pdf DNHS Official Publication
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
special B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdfspecial B.ed 2nd year old paper_20240531.pdf
special B.ed 2nd year old paper_20240531.pdf
 
Instructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptxInstructions for Submissions thorugh G- Classroom.pptx
Instructions for Submissions thorugh G- Classroom.pptx
 
Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345Sha'Carri Richardson Presentation 202345
Sha'Carri Richardson Presentation 202345
 
Language Across the Curriculm LAC B.Ed.
Language Across the  Curriculm LAC B.Ed.Language Across the  Curriculm LAC B.Ed.
Language Across the Curriculm LAC B.Ed.
 
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdfAdversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
Adversarial Attention Modeling for Multi-dimensional Emotion Regression.pdf
 
Palestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptxPalestine last event orientationfvgnh .pptx
Palestine last event orientationfvgnh .pptx
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 
1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx1.4 modern child centered education - mahatma gandhi-2.pptx
1.4 modern child centered education - mahatma gandhi-2.pptx
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf678020731-Sumas-y-Restas-Para-Colorear.pdf
678020731-Sumas-y-Restas-Para-Colorear.pdf
 
"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Acetabularia Information For Class 9 .docx
Acetabularia Information For Class 9  .docxAcetabularia Information For Class 9  .docx
Acetabularia Information For Class 9 .docx
 
The Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdfThe Roman Empire A Historical Colossus.pdf
The Roman Empire A Historical Colossus.pdf
 
Overview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with MechanismOverview on Edible Vaccine: Pros & Cons with Mechanism
Overview on Edible Vaccine: Pros & Cons with Mechanism
 
A Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in EducationA Strategic Approach: GenAI in Education
A Strategic Approach: GenAI in Education
 
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
 
Supporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptxSupporting (UKRI) OA monographs at Salford.pptx
Supporting (UKRI) OA monographs at Salford.pptx
 
Francesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptxFrancesca Gottschalk - How can education support child empowerment.pptx
Francesca Gottschalk - How can education support child empowerment.pptx
 

Dremel Paper Review

  • 1. Dremel - Interactive Analysis of Web-Scale Dataset Paper Review Arinto Murdopo November 8, 2012 1 Motivation The main motivation of the paper is the inability of existing big data infrastructure (MapReduce-BigTable and Hadoop) to perform fast ad-hoc explorations/queries into web-scale data-sets. In this context, ad-hoc explorations/queries mean on-the-fly queries, which are issued by users when they need it. The execution time of ad-hoc queries is expected to be fast so that users can interactively explore the datasets. Secondary motivation of the paper is the needs of more user-friendly query mechanism to make data analytic in big data infrastructure easier and faster. Pig and Hive try to solve this challenge by providing SQL-like query language in Hadoop, but they only solve the usability issue and not the performance issue. Solving these issues allows faster, more efficient and more effective data analytic in big data infrastructure which implies higher productivity for data scientists and engineers who are working in big data analytic. Hence, these issues are interesting and important to be solved. 2 Contributions The main contribution of Dremel is high-performance technique in processing web-scale data-set for ad-hoc query usage. The mechanism solves these following challenges: 1. Inefficiency in storing data for ad-hoc query. Ad-hoc queries most of the time do not need all the available field/column in a table. Therefore the authors propose columnar data model that improve data retrieval performance. It was novel solution because well-known big data processing platforms (such as MapReduce on Hadoop) work at record-structure data model at that time. 2. Performance and scalability challenge in processing the data. 1
  • 2. Multi-level serving tree model allows high parallelism in processing the data. It also allows optimization of query in each of the tree level. Query scheduling allows prioritization of the execution. Both techniques solve performance challenge in data processing. Scalability challenge is also solved since we can easily add more nodes to process more data in multi-level serving tree model. Separation of query scheduler from root server is good because it decouples query scheduling responsibility with job tracking (note that: job tracking is performed by root server). This scheduler model implies higher scalability when the number of leaf servers is huge. The minor contribution of this paper is the use of actual Google data set in the experi- ment. This usage gives other researchers insight on the practical magnitude of web-scale data-sets 3 Solutions 3.1 Columnar Data Model As mentioned before, ad-hoc queries only need small subset of fields/columns in tables. Record-structure data model introduces significant inefficiencies. The reason is in order to retrieve the needed fields/columns, we need to read the whole record data, including unnecessary fields/columns. To reduce these inefficiencies, columnar data model is intro- duced. Columnar data model allows us to read only the needed fields/columns for ad hoc queries. This model will reduce the data retrieval inefficiencies and increase the speed. The authors also explain how to convert from record-structure data model to columnar model and vice versa. 3.2 Multi-level Serving Tree The authors use multi-level serving tree in columnar data processing. Typical multi-level serving tree consists of a root server, several intermediate servers and many leaf servers. There are two reasons behind multi-level serving tree: 1. Characteristics of ad hoc queries where the result set size is small or medium. The overhead of processing these kinds of data in parallel is small. 2. High degree of parallelism to process small or medium-size data. 3.3 Query Dispatcher The authors propose a mechanism to regulate resource allocation for each leaf servers and the mechanism is handled by module called query dispatcher. Query dispatcher works in slot(number of processing unit available for execution) unit. It allocates appropriate number of tablets into their respective slot. It deals with stragglers by moving the tablets from slow straggler slots into new slots. 2
  • 3. 3.4 SQL-like query People with SQL background can easily use Dremel to perform data analytics in web- scale datasets. Unlike Pig and Hive, Dremel does not convert SQL-like queries into MapReduce jobs, therefore Dremel should have faster execution time compared to Pig and Hive. 4 Strong Points 1. Identification of the characteristics of ad-hoc queries data set. The authors correctly identify the main characteristic of data set returned from ad-hoc queries, which is: only small number of fields are used by ad hoc-queries. This finding allows the authors to develop columnar data model and use multi-level serving tree to process the columnar data model. 2. Fast and lossless conversion between nested record structure model and columnar data model. Although columnar data model has been used in other related works, the fast and lossless conversion algorithm that the authors propose is novel and one of the key contributions of Dremel. 3. Magnitude and variety of datasets for experiments. The magnitude of datasets is huge and practical. These magnitude and variety of data-sets increase the confirming power of Dremel solution and proof its high performance. 5 Weak Points 1. Record-oriented data model can still outperform columnar data model. This is the main shortcoming of Dremel, however credit must be given on the authors since they do not hide this shortcoming and they provide some insight on this shortcoming. 2. Performance analysis on the data model conversion is not discussed. They claim that the conversion is fast, but they do not support this argument using experiment data. 3. "Cold" setting usage in Local Disk experiment. In Local Disk experiment, the au- thors mention that "all reported times are cold". Using cold setting in database or storage benchmarking is not recommended because the data is highly biased with disk access performance. When the database is "cold", query execution per- formance in the start of the experiment will highly depend on the disk speed. In the start of the experiment, most of the operations involve moving data from disk to OS cache, and the execution performance will be dominated by disk access. 3