Apache Hive facilitates querying and managing large datasets residing in distributed storage. Being used by a very wide community, Hive has been extended to support multiple distributed storage systems. It is now a common practice to have data in different storage systems within an organization. This presentation covers two important aspects of Apache Hive. First aspect covers how Hive makes it possible for organizations to run complex analytical queries across various storage systems or big data components. We recently added HiveKa, to support hive queries on Kafka, and will use it as an example. At Cloudera, we focus not only on providing solutions to help organizations answer bigger questions, but we also make sure that these solutions are robust. The second aspect of this presentation goes over advanced methods/ technologies, like, Random Query Generators, Dockers, Benchmarks, etc that we use at Cloudera to make sure Hive is ready to find right answers from that huge Volume, high Velocity and various Varieties of today’s data.
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created
Hadoop = Distributed computational fwk + Distributed storage system
Huge community of developers and users
Rich feature sets
Modular enough to plug in any computational fwk, MR, Spark, next cool engine
Storage handlers to treat any system as its storage system
InputFormat, OutputFormat, serde
First introduced for Hbase
Hbase, JDBC, MongoDB, Google Spreadsheet, Solr, ElasticSearch, Kafka, etc.
High throughput distributed messaging system
Scalable and resilient, and low-latency
With
With
With
With
With
With
With
With
With
With
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created
The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s; as of 2012, every day 2.5 exabytes of data were created