SlideShare a Scribd company logo
1 of 30
Download to read offline
Apache  Tajo:  An  Open  Source    
Big  Data  Warehouse  
(what’s  new  in  recent  releases)
HadoopSphere	
  -­‐	
  Virtual	
  Conclave	
  2015	
  
Hyunsik	
  Choi,	
  Gruter	
  Inc.	
  
(hschoi	
  @	
  gruter.com)	
  
1	
  
Agenda
•  Tajo	
  Overview	
  
•  Milestones	
  and	
  0.10	
  Features	
  
•  What’s	
  Next	
  
2	
  
Tajo:  A  Big  Data  Warehouse  System
•  Apache	
  Top-­‐level	
  project	
  
•  Distributed	
  and	
  scalable	
  data	
  warehouse	
  system	
  on	
  various	
  data	
  
sources	
  (e.g,	
  HDFS,	
  S3,	
  Hbase,	
  …)	
  
•  Low	
  latency,	
  and	
  long	
  running	
  batch	
  queries	
  in	
  a	
  single	
  system	
  
•  Features	
  
•  ANSI	
  SQL	
  compliance	
  
•  Mature	
  SQL	
  features	
  
•  ParYYoned	
  table	
  support	
  
•  Java/Python	
  UDF	
  support	
  
•  JDBC	
  driver	
  and	
  Java-­‐based	
  asynchronous	
  API	
  
•  Read/Write	
  support	
  of	
  CSV,	
  JSON,	
  RCFile,	
  SequenceFile,	
  Parquet,	
  ORC	
  
3	
  
 
 
 
 
Master
 Server
 
 
 
 
 
 
 
 
 
TajoMaster
 
 
 
 
 
 
Slave
 Server
 
 
 
 
 
 

More Related Content

What's hot

What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondGruter
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBaseHBaseCon
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaCloudera, Inc.
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopCloudera, Inc.
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impalamarkgrover
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Cloudera, Inc.
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopfann wu
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopCloudera, Inc.
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupMike Percy
 

What's hot (20)

What's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its BeyondWhat's New Tajo 0.10 and Its Beyond
What's New Tajo 0.10 and Its Beyond
 
Evolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage SubsystemEvolving HDFS to a Generalized Storage Subsystem
Evolving HDFS to a Generalized Storage Subsystem
 
Content Identification using HBase
Content Identification using HBaseContent Identification using HBase
Content Identification using HBase
 
Building a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
 
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014Cloudera Impala - San Diego Big Data Meetup August 13th 2014
Cloudera Impala - San Diego Big Data Meetup August 13th 2014
 
Cloudera Impala
Cloudera ImpalaCloudera Impala
Cloudera Impala
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
Performance Optimizations in Apache Impala
Performance Optimizations in Apache ImpalaPerformance Optimizations in Apache Impala
Performance Optimizations in Apache Impala
 
Impala: Real-time Queries in Hadoop
Impala: Real-time Queries in HadoopImpala: Real-time Queries in Hadoop
Impala: Real-time Queries in Hadoop
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Introduction to Impala
Introduction to ImpalaIntroduction to Impala
Introduction to Impala
 
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
 
Facing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoopFacing enterprise specific challenges – utility programming in hadoop
Facing enterprise specific challenges – utility programming in hadoop
 
Impala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for HadoopImpala 2.0 - The Best Analytic Database for Hadoop
Impala 2.0 - The Best Analytic Database for Hadoop
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Intro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application MeetupIntro to Apache Kudu (short) - Big Data Application Meetup
Intro to Apache Kudu (short) - Big Data Application Meetup
 
What database
What databaseWhat database
What database
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 

Viewers also liked

Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primerpartha69
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopDatameer
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsDavid Portnoy
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsDavid Portnoy
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data WarehouseCaserta
 

Viewers also liked (6)

Open Source DWBI-A Primer
Open Source DWBI-A PrimerOpen Source DWBI-A Primer
Open Source DWBI-A Primer
 
Complement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & HadoopComplement Your Existing Data Warehouse with Big Data & Hadoop
Complement Your Existing Data Warehouse with Big Data & Hadoop
 
Hybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop ImplementationsHybrid Data Warehouse Hadoop Implementations
Hybrid Data Warehouse Hadoop Implementations
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Comparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse PlatformsComparison of MPP Data Warehouse Platforms
Comparison of MPP Data Warehouse Platforms
 
Hadoop and Your Data Warehouse
Hadoop and Your Data WarehouseHadoop and Your Data Warehouse
Hadoop and Your Data Warehouse
 

Similar to Apache Tajo - An open source big data warehouse

Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterData Con LA
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadooplarsgeorge
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Fwdays
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurgeRTTS
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Gruter
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Ashish Narasimham
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World DistilledRTTS
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarRTTS
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutionssolarisyougood
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceHortonworks
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Martin Bém
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overviewStreamHorizon
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architectureJoseph D'Antoni
 

Similar to Apache Tajo - An open source big data warehouse (20)

Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of GruterBig Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
Big Data Day LA 2015 - What's New Tajo 0.10 and Beyond by Hyunsik Choi of Gruter
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in HadoopBackup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
Виталий Бондаренко "Fast Data Platform for Real-Time Analytics. Architecture ...
 
Testing Big Data: Automated Testing of Hadoop with QuerySurge
Testing Big Data: Automated  Testing of Hadoop with QuerySurgeTesting Big Data: Automated  Testing of Hadoop with QuerySurge
Testing Big Data: Automated Testing of Hadoop with QuerySurge
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Apache Tajo - BWC 2014
Apache Tajo - BWC 2014Apache Tajo - BWC 2014
Apache Tajo - BWC 2014
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30Big data processing engines, Atlanta Meetup 4/30
Big data processing engines, Atlanta Meetup 4/30
 
the Data World Distilled
the Data World Distilledthe Data World Distilled
the Data World Distilled
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Horizon for Big Data
Horizon for Big DataHorizon for Big Data
Horizon for Big Data
 
QuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing WebinarQuerySurge Slide Deck for Big Data Testing Webinar
QuerySurge Slide Deck for Big Data Testing Webinar
 
Oracle big data appliance and solutions
Oracle big data appliance and solutionsOracle big data appliance and solutions
Oracle big data appliance and solutions
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27Prague data management meetup 2018-03-27
Prague data management meetup 2018-03-27
 
StreamHorizon overview
StreamHorizon overviewStreamHorizon overview
StreamHorizon overview
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
The modern analytics architecture
The modern analytics architectureThe modern analytics architecture
The modern analytics architecture
 

Recently uploaded

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Recently uploaded (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Apache Tajo - An open source big data warehouse

  • 1. Apache  Tajo:  An  Open  Source     Big  Data  Warehouse   (what’s  new  in  recent  releases) HadoopSphere  -­‐  Virtual  Conclave  2015   Hyunsik  Choi,  Gruter  Inc.   (hschoi  @  gruter.com)   1  
  • 2. Agenda •  Tajo  Overview   •  Milestones  and  0.10  Features   •  What’s  Next   2  
  • 3. Tajo:  A  Big  Data  Warehouse  System •  Apache  Top-­‐level  project   •  Distributed  and  scalable  data  warehouse  system  on  various  data   sources  (e.g,  HDFS,  S3,  Hbase,  …)   •  Low  latency,  and  long  running  batch  queries  in  a  single  system   •  Features   •  ANSI  SQL  compliance   •  Mature  SQL  features   •  ParYYoned  table  support   •  Java/Python  UDF  support   •  JDBC  driver  and  Java-­‐based  asynchronous  API   •  Read/Write  support  of  CSV,  JSON,  RCFile,  SequenceFile,  Parquet,  ORC   3  
  • 4.  
  • 5.  
  • 6.  
  • 9.  
  • 10.  
  • 11.  
  • 12.  
  • 13.  
  • 14.  
  • 15.  
  • 16.  
  • 18.  
  • 19.  
  • 20.  
  • 21.  
  • 22.  
  • 25.  
  • 26.  
  • 27.  
  • 28.  
  • 29.  
  • 30.  
  • 31.  
  • 32.  
  • 33.  
  • 34.  
  • 35.  
  • 36.  
  • 37.  
  • 38.  
  • 39.  
  • 40.  
  • 41.  
  • 42.  
  • 43.  
  • 44.  
  • 46.  
  • 47.  
  • 48.  
  • 49.  
  • 50.  
  • 51.  
  • 52.  
  • 53.  
  • 54.  
  • 55.  
  • 63.  
  • 64.  
  • 66.  
  • 67.  
  • 68.  
  • 69.  
  • 70.  
  • 74.  UI
  • 75.  
  • 76.  
  • 77.  
  • 78.  
  • 79.  
  • 82.  
  • 83.  
  • 84.  
  • 85.  
  • 86.  
  • 87.  
  • 88.  
  • 89.  
  • 90.  
  • 91.  
  • 92.  
  • 93.  
  • 94.  
  • 95.  
  • 96.  
  • 97.  
  • 98.  
  • 99.  
  • 100.  
  • 101.  
  • 103.  
  • 104.  
  • 105.  
  • 106.  
  • 107.  
  • 108.  
  • 109.  
  • 110.  
  • 111.  
  • 112.  
  • 118.  
  • 119.  
  • 120.  
  • 121.  
  • 122.  
  • 125.  
  • 126.  
  • 127.  
  • 128.  
  • 129.  
  • 130.  
  • 131.  
  • 132.  
  • 133.  
  • 134.  
  • 135.  
  • 136.  
  • 137.  
  • 138.  
  • 139.  
  • 140.  
  • 141.  
  • 142.  
  • 143.  
  • 144.  
  • 146.  
  • 147.  
  • 148.  
  • 149.  
  • 150.  
  • 151.  
  • 152.  
  • 153.  
  • 154.  
  • 155.  
  • 165.  
  • 166.  a
  • 171.  a
  • 175.  
  • 177.  
  • 180.  
  • 182.  
  • 188. Common  Scenarios •  ExtracYon,  TransformaYon,  Loading  (ETL)   •  InteracYve  BI/analyYcs  on  web-­‐scale  big  data   •  Data  discovery/Exploratory  analysis  with  R  and   exisYng  SQL  tools   5  
  • 189. Use  Cases:  Replacement  of  Commercial  DW •  Example:  Biggest  Telco  Company  in  South  Korea   •  Goal:   •  Replacement  of  slow  ETL  workloads  on  several  TB  datasets   •  Lots  daily  reports  generaYon  about  users’  behaviors   •  Ad-­‐hoc  analysis  on  Terabytes  data  sets   •  Key  Benefits  of  Tajo:   •  SimplificaYon  of  DW  ETL,  OLAP,  and  Hadoop  ETL  into  an   unified  system   •  Saved  license  over  commercial  DW   •  Much  less  cost,  more  data  analysis  within  the  same  SLA   6  
  • 190. Use  Cases:  Data  Discovery •  Example:  Music  streaming  service                                      (26  million  users)   •  Goal:     •  Analysis  on  purchase  history  for  target  markeYng     •  Benefits:   •  Query  interacYvity  on  large  data  sets   •  Ability  to  use  exisYng  BI  visualizaYon  tools   7  
  • 191. When  Tajo  is  right  choice? •  You  want  an  unified  system  for  batch  and   interacYve  queries  on  Hadoop,  Amazon  S3,  or   Hbase.   •  You  want  a  mixed  use  of  Hadoop-­‐based  DW  and   RDBMS-­‐based  DW  or  want  to  replace  exisYng   RDBMS  DW.   •  You  want  to  use  exisYng  SQL  tools  on  Hadoop  DW   8  
  • 192. Milestones 0.8   0.9   0.10   0.11   More  features       SQL  compaYbility   Stability     AnalyYcal   funcYon   Eco-­‐system   expansion   More  features   •  Python  UDF   •  Nested  Schema   •  Tablespace  support   •  Query  federaYon   •  Beker  query  scheduler   9  
  • 193. Selected  Features  in  0.10 10  
  • 194. Hbase  Storage  Support •  You  can  use  SQL  to  access  Hbase  tables.   •  Tajo  supports  Hbase  storage   •  CREATE  (EXTERNAL)/DROP/INSERT  (OVERWRITE)/ SELECT   •  Bulk  InserYon  through  Direct  HFile  wriYng     CREATE TABLE hbase_t1 (key TEXT, col1 TEXT, col2 INT) USING hbase WITH ( ‘table’ = ‘t1’, ‘columns’ = ‘:key,cf1:col1,cf2:col2`, ‘hbase.zookeeper.quorum’ = ‘host1:2181,host2:2181’ ) 11  
  • 195. BeNer  AWS  support •  OpYmized  for  S3  and  EMR  environments   •  Fixed  many  bugs  related  to  S3   •  EMR  bootstrap  supported  in  AWS  Labs  Github  repo   •  A  quick  guide  for  Tajo  on  EMR   •  hkp://www.gruter.com/blog/semng-­‐up-­‐a-­‐tajo-­‐cluster-­‐on-­‐amazon-­‐emr/   •  EMR  bootstrap  for  Tajo  on  EMR   •  hkps://github.com/awslabs/emr-­‐bootstrap-­‐acYons/tree/master/tajo   12  
  • 196. Tajo  JDBC   Tajo  Cluster ETL  Tools   BI  Tools   Repor.ng  tools   BeNer  SQL  tool  support  via  thin  JDBC HDFS   HBase   S3   Swin   13  
  • 198. Improved  Performance  and  Stability •  Ooeap  sort  operator  for  ORDER  BY  (TAJO-­‐907)   •  Hash  shuffle  IO  improvement  (TAJO-­‐374,  TAJO-­‐987)   •  Skewness  handling  of  hash  shuffle   •  AutomaYc  parallel  degree  choice  during  runYme   •  Lots  of  query  opYmizer  improvements   •  Add  Master  HA  (TAJO-­‐704)   •  More  error-­‐tolerant  shuffle  fetch  (TAJO-­‐789,  TAJO-­‐953)   15  
  • 199. What’s  New  in  Tajo  0.11 16  
  • 200. Nested  data  and  JSON  support •  Nested  data  is  becoming  common   •  JSON,  BSON,  XML,  Protocol  Buffer,  Avro,  Parquet,  …   •  Many  web  applicaYons  in  common  use  JSON.   •  MongoDB  by  default  uses  JSON  document   •  Many  Hbase  users  also  store  JSON  document  in  a  cell.   •  Flakening  causes  lots  of  data/computaYon   overhead.   •  Tajo  0.11  naYvely  supports  nested  data  types.   17  
  • 201. How  to  create  a  nested  schema  table Use  ‘RECORD’  keyword  to  define  complex  data  type   18  
  • 202. Loose  schema  for  self-­‐describing  formats You  can  handle  schema  evolving  with  ALTER  ADD  COLUMN!   19  
  • 203. How  to  retrieve  nested  fields Input  Data   Table  DefiniYon   SQL   20  
  • 204. Query  federaTon  and  Tablespace  support •  Query  support  across  mulYple  data  sources   •  You  can  perform  join  or  union  among  tables  on  different  systems.   •  Benefits:   •  Data  offload  from  RDBMS  to  Hadoop  vice  versa   •  A  mixed  use  of  exisYng  RDBMS  and  Hadoop.   •  Access  to  NoSQL  and  various  storages  through  SQL   •  An  unified  interface  for  SQL  tools   HDFS   NoSQL   S3   Swin   Apache  Tajo   21  
  • 205. Sequence  File   RCFile   Protocol  Buffer   Data   Formats   Storage   Types   Datasets  stored  in  Various  Formats/Storages ORC   22  
  • 206. Tablespace •  Tablespace   •  Registered  storage  space   •  A  table  space  is  idenYfied  by  an  unique  URI   •  ConfiguraYon  and  Policy  shared  in  all  tables  in  the  same   tablespace   •  It  allows  users  to  reuse  registered  storages  and  their   configuraYon.   23  
  • 207. Tablespace  ConfiguraTon Tablespace  name   Tablespace  URI   24  
  • 208. Create  Table  on  a  specified  Tablespace CREATE TABLE uptodate (key TEXT, …) TABLESPACE hbase1; CREATE TABLE archive (l_orderkey bigint, …) TABLESPACE warehouse USING text WITH (‘text.delimiter’ = ‘|’); Tablespace  Name   Format  name   25  
  • 209. OperaTon  Push  Down SELECT X, SUM(Y) FROM table1 WHERE x 100 GROUP BY x Underlying   Storage   Filter,  ProjecYon  or  Groupby  can  be  pushed  down  into   Underlying  storages  (like  RDBMS,  Hbase,     ElasYcsearch,  …)   26  
  • 210. Current  Status  of  Storages •  Storages:   •  HDFS  support   •  Amazon  S3  and  Openstack  Swin   •  Hbase  Scanner  and    Writer  -­‐  HFile  and  Put  Mode   •  JDBC-­‐based  Scanner  and  Writer  (Working)   •  Kara  Scanner  (Patch  Available)   •  ElasYc  Search  (Patch  Available)   •  Data  Formats   •  Text,  JSON,  RCFile,  SequenceFile,  Avro,  Parquet,  and   ORC  (Patch  Available)   27  
  • 211. Python  UDF •  Python  UDF  and  UDAF  are  supported  in  Tajo   •  hkp://tajo.apache.org/docs/devel/funcYons/python.html   @output_type('int4')
 def return_one():
  return 1
 
 @output_type('text')
 def helloworld():
  return 'Hello, World’
 
 @output_type('int4')
 def sum_py(a,b):
  return a+b 28  
  • 212. Get  Involved! •  We  are  recruiYng  contributors!   •  General   •  hkp://tajo.apache.org   •  Gemng  Started   •  hkp://tajo.apache.org/docs/0.10.0/gemng_started.html   •  Downloads   •  hkp://tajo.apache.org/downloads.html   •  Jira  –  Issue  Tracker   •  hkps://issues.apache.org/jira/browse/TAJO   •  Join  the  mailing  list   •  dev-­‐subscribe@tajo.apache.org   •  issues-­‐subscribe@tajo.apache.org   29