The Evolution of Data Analysis with Hadoop - StampedeCon 2014

765 views

Published on

At StampedeCon 2014, Tom Wheeler (Cloudera) presented, "The Evolution of Data Analysis with Hadoop."

This session will lead the audience through the evolution of data analysis in Hadoop to illustrate its progression from the original low-level, batch-oriented MapReduce approach to today’s higher-level interactive tools that require very little technical knowledge. We’ll discuss Apache Crunch, Hive, Impala and Solr.

While the nature of this talk is somewhat technical, no prior knowledge of Hadoop or any specific programming language is required. Frequent live demonstrations of the tools discussed will emphasize that analyzing data in Hadoop can be as easy as using a relational database or Internet search engine.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
765
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

The Evolution of Data Analysis with Hadoop - StampedeCon 2014

  1. 1. The$Evolu*on$of$Data$Analysis$with$ Hadoop$ Tom$Wheeler$ |$$StampedeCon$2014$
  2. 2. About$the$Presenta*on…$ •  What’s$ahead$ •  Defining$Hadoop$ •  Data$Processing$with$MapReduce$ •  Simplifying$Development$with$Apache$Crunch$ •  Bringing$MapReduce$to$Analysts$with$Apache$Hive$ •  GeMng$Results$Faster$with$Cloudera$Impala$ •  Finding$Data$Made$Easy$with$Apache$Solr$/$Cloudera$Search$ •  Conclusion$+$Q&A$
  3. 3. Important$Trends$ •  Ubiquitous$connec*vity$ •  We$produce$more$data$than$ever$ •  UserVgenerated$content$ •  Lacks$rigid$structure$ •  Inexpensive$storage$ •  Permanent$reten*on$ f
  4. 4. Big$Data$Can$Mean$Big$Opportunity$ •  One$tweet$is$an$anecdote$ •  But$a$million$tweets$can$signal$important$trends$ •  One$person’s$product$review$is$an$opinion$ •  But$a$million$reviews$might$reveal$a$design$flaw$ •  One$person’s$diagnosis$is$an$isolated$case$ •  But$a$million$medical$records$could$lead$to$a$cure$
  5. 5. What$is$Apache$Hadoop?$ •  Distributed$data$storage$and$processing$ •  Scalable,$flexible,$and$economical$ •  Open$source$ •  Inspired$by$Google$ •  Two$main$components$ •  Hadoop$Distributed$File$System$(HDFS)$ •  MapReduce$
  6. 6. GeMng$Data$into$HDFS$ •  HDFS$is$dis*nct$from$your$local$filesystem$ Local Filesystem Hadoop Distributed File System (HDFS) Local Filesystem
  7. 7. What$is$MapReduce?$ •  MapReduce$is$a$programming)model) •  You$supply$two$processing$func*ons:$Map$and$Reduce$ •  Map:$typically$used$to$transform,$parse,$or$filter$data$ •  Reduce:$typically$used$to$summarize$results$(op*onal)$ •  MapReduce$in$Hadoop$is$batchVoriented$
  8. 8. Why$MapReduce?$ •  MapReduce$simplifies$parallel$processing$ •  Code$is$typically$wricen$in$Java$ •  Shields$developers$from$complexity$of$distributed$compu*ng$ •  No$explicit$synchroniza*on,$network$sockets,$or$file$I/O$ •  S*ll,$it$is$tedious$to$write$MapReduce$directly…$
  9. 9. But$MapReduce$is$like$Assembly$Language…$ •  MapReduce$is$powerful$and$scalable$ •  But$wri*ng$MapReduce$code$directly$in$Java$can$be$tedious$ •  Business$logic$typically$comprises$just$a$frac*on$of$overall$code$ •  Many$realVworld$computa*ons$involve$a$sequence$of$jobs$ •  Chaining$mul*ple$MapReduce$jobs$increases$the$complexity$ •  Apache$Crunch$is$designed$to$address$these$problems$
  10. 10. What$is$Apache$Crunch?$ •  Apache$Crunch$is$a$library$that$simplifies$parallel$processing$ •  OpenVsource$implementa*on$of$Google's$internal$library$ •  Provides$a$highVlevel$API$targeted$at$Java$developers$ •  No$detailed$knowledge$of$MapReduce$required$ •  Faster$and$easier$than$wri*ng$MapReduce$code$directly$ •  Retains$the$power$and$expressiveness$of$Java$
  11. 11. What$is$Apache$Hive?$ •  HighVlevel$data$processing$on$Hadoop$ •  Another$alterna*ve$to$wri*ng$MapReduce$code$ •  Queries$data$in$HDFS$using$a$SQLVlike$language$ SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;
  12. 12. Hive$Data$and$Metadata$ •  As$with$a$database,$you$query$one$or$more$tables$ •  Hive$tables$are$just$a$façade$for$a$directory$of$data$in$HDFS$ •  Default$file$format$is$delimited$text,$but$many$others$supported$ •  Table$structure$and$loca*on$are$specified$during$crea*on$ •  Metadata$is$stored$in$an$RDBMS$ •  Tables$can$be$populated$by$loading$$ data$into$HDFS$directory$ Data$in$HDFS mytable 1 2 Metastore
  13. 13. What$is$Cloudera$Impala?$ •  Massively$parallel$SQL$engine$for$Hadoop$ •  Supports$ad$hoc$/$interac*ve$queries$on$data$in$HDFS$ •  Uses$custom$execu*on$engine$instead$of$MapReduce$ •  Query$syntax$virtually$iden*cal$to$HiveQL$/$SQL$ •  Shares$metadata$with$Hive$ •  Much,$much$faster$than$Hive$ •  Impala$is$100%$open$source$(ApacheVlicensed)$
  14. 14. Apache$Solr$(and$Cloudera$Search)$ •  Apache$Solr$provides$highVperformance$indexing$and$search$ •  Mature$plajorm$with$widespread$deployment$ •  Requires$licle$technical$skill$for$end$users,$yet$s*ll$powerful$ •  Cloudera$integrates$Solr$to$search$data$in$HDFS $$ •  CDH$offers$scalability$and$reliability$ •  Distributed$data$storage$and$indexing$ •  Cloudera$Search$is$open$source,$just$like$Apache$Solr$itself$
  15. 15. Conclusion$ •  Thanks$for$having$me!$ •  Any$ques*ons?$

×