Performing Data Science with HBase  Aaron Kimball – CTO  Kiyan Ahmadizadeh – MTS
MapReduce and log files  Log files              Batch analysis                               Result data set
The way we build apps is changing
HBase plays a big role• High performance random access• Flexible schema & sparse storage• Natural mechanism for time serie...
Batch machine learning
WibiData: architecture
Data science lays the groundwork• Feature selection requires insight• Data lies in HBase, not log files• MapReduce is too ...
Data science lays the groundwork• Feature selection requires insight• Data lies in HBase, not log files• MapReduce is too ...
Why not Hive?• Need to manually sync column schemas• No complex type support for HBase + Hive  – Our use of Avro facilitat...
Why not Pig?• Tuples handle sparse data poorly• No support for time series events in columns• UDFs and main pipeline are w...
Our analysis needs•   Read from HBase•   Express complex concepts•   Support deep MapReduce pipelines•   Be written in con...
Our analysis needs• Concise: We use Scala• Powerful: We use Apache Crunch• Interactive: We built a shellwibi>
WDL: The WibiData Languagewibi> 2 + 2     wibi> :tablesres0: Int = 4   Table     Description                ======== =====...
Outline•   Analyzing Wikipedia•   Introducing Scala•   An Overview of Crunch•   Extending Crunch to HBase + WibiData•   De...
Analyzing Wikipedia• All revisions of all English pages• Simulates real system that could be built on  top of WibiData• Al...
Per-user information• Rows keyed by Wikipedia user id or IP address• Statistics for several metrics on all edits made  by ...
Introducing• Scala language allows declarative statements• Easier to express transformations over your  data in an intuiti...
Example: Iterating over listsdef printChar(ch: Char): Unit = {  println(ch)}val lst = List(a, b, c)lst.foreach(printChar)
… with an anonymous functionval lst = List(a, b, c)lst.foreach( ch => println(ch) )• Anonymous function can be specified a...
Example: Transforming a listval lst = List(1, 4, 7)val doubled = lst.map(x => x * 2)• map() applies a function to each ele...
Example: Filtering• Apply a boolean function to each element of a  list, keep the ones that return true:val lst = List(1, ...
Example: Aggregationval lst = List(1, 2, 12, 5)lst.reduceLeft( (sum, x) => sum + x )// Evaluates to 20.• reduceLeft() aggr...
Crunch: MapReduce pipelinesdef runWordCount(input, output) = {  val wordCounts = read(From.textFile(input))    .flatMap(li...
PCollections: Crunch data sets• Represent a parallel record-oriented data set• Items in PCollections can be lines of text,...
PCollections… of WibiRows•   WibiRow: Represents a row in a Wibi table•   Enables access to sparse columns•   … as a value...
Introducing: Kiyan Ahmadizadeh
Demo!
Demo: Visualizing Distributions• Suppose you have a metric taken on some population of  users.• Want to visualize what the...
Demo: Wikimedia Dataset• We have a user table containing the average  delta for all edits made by a user to pages.• Edit D...
Demo!
Code: Accessing Dataval stats =   ColumnFamily[Stats]("edit_metric_stats")val userTable = read(    From.wibi(WibiTableAddr...
Code: Accessing Dataval stats =   ColumnFamily[Stats]("edit_metric_stats")val userTable = read(    From.wibi(WibiTableAddr...
Code: Accessing Dataval stats =   ColumnFamily[Stats]("edit_metric_stats")val userTable = read(    From.wibi(WibiTableAddr...
Code: Accessing Dataval stats =   ColumnFamily[Stats]("edit_metric_stats")val userTable = read(    From.wibi(WibiTableAddr...
Code: UDFsdef getBin(bins: Range, value: Double): Int = {  bins.reduceLeft ( (choice, bin) =>    if (value < bin) choice e...
Code: UDFsdef getBin(bins: Range, value: Double): Int = {  bins.reduceLeft ( (choice, bin) =>    if (value < bin) choice e...
Code: Filteringval filtered = userTable.filter { row =>  // Keep editors who have edit_metric_stats:delta defined  !row(st...
Code: Filteringval filtered = userTable.filter { row =>  // Keep editors who have edit_metric_stats:delta defined  !row(st...
Code: Filteringval filtered = userTable.filter { row =>  // Keep editors who have edit_metric_stats:delta defined  !row(st...
Code: Filteringval filtered = userTable.filter { row =>  // Keep editors who have edit_metric_stats:delta defined  !row(st...
Code: Binningval binCounts = filtered.map { row =>  // Bucket mean deltas for histogram  getBin(bins, abs(row(stats).get("...
Code: Binningval binCounts = filtered.map { row =>  // Bucket mean deltas for histogram  getBin(bins, abs(row(stats).get("...
Code: Binningval binCounts = filtered.map { row =>  // Bucket mean deltas for histogram  getBin(bins, abs(row(stats).get("...
Code: Binningval binCounts = filtered.map { row =>  // Bucket mean deltas for histogram  getBin(bins, abs(row(stats).get("...
Code: Binningval binCounts = filtered.map { row =>  // Bucket mean deltas for histogram  getBin(bins, abs(row(stats).get("...
Code: VisualizationHistogram.plot(binCounts,    bins,    "Histogram of Editors by Mean Delta",    "Mean Delta",    "Number...
Analysis Results: 1% of Data
Analysis Results: Full Data Set
Conclusions           +                   +     = Scalable analysis of sparse data
Aaron Kimball – aaron@wibidata.comKiyan Ahmadizadeh – kiyan@wibidata.com
Performing Data Science with HBase
Upcoming SlideShare
Loading in...5
×

Performing Data Science with HBase

1,273

Published on

In our Hadoop World 2012 talk "Performing Data Science with HBase", Aaron Kimball and Kiyan Ahmadizadeh demonstrated publicly for the first time our new WibiData Language (or "WDL" for short) -- a concise, powerful, and interactive tool for engineers and data scientists to explore data and write ad-hoc queries against datasets stored in WibiData.

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,273
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
35
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Performing Data Science with HBase

  1. 1. Performing Data Science with HBase Aaron Kimball – CTO Kiyan Ahmadizadeh – MTS
  2. 2. MapReduce and log files Log files Batch analysis Result data set
  3. 3. The way we build apps is changing
  4. 4. HBase plays a big role• High performance random access• Flexible schema & sparse storage• Natural mechanism for time series data … organized by user
  5. 5. Batch machine learning
  6. 6. WibiData: architecture
  7. 7. Data science lays the groundwork• Feature selection requires insight• Data lies in HBase, not log files• MapReduce is too cumbersome for exploratory analytics
  8. 8. Data science lays the groundwork• Feature selection requires insight• Data lies in HBase, not log files• MapReduce is too cumbersome for exploratory analytics• This talk: How do we explore data in HBase?
  9. 9. Why not Hive?• Need to manually sync column schemas• No complex type support for HBase + Hive – Our use of Avro facilitates complex record types• No support for time series events in columns
  10. 10. Why not Pig?• Tuples handle sparse data poorly• No support for time series events in columns• UDFs and main pipeline are written in different languages (Java, Pig Latin) – (True of Hive too)
  11. 11. Our analysis needs• Read from HBase• Express complex concepts• Support deep MapReduce pipelines• Be written in concise code• Be accessible to data scientists more comfortable with R & python than Java
  12. 12. Our analysis needs• Concise: We use Scala• Powerful: We use Apache Crunch• Interactive: We built a shellwibi>
  13. 13. WDL: The WibiData Languagewibi> 2 + 2 wibi> :tablesres0: Int = 4 Table Description ======== =========== page Wiki page info user per-user stats
  14. 14. Outline• Analyzing Wikipedia• Introducing Scala• An Overview of Crunch• Extending Crunch to HBase + WibiData• Demo!
  15. 15. Analyzing Wikipedia• All revisions of all English pages• Simulates real system that could be built on top of WibiData• Allows us to practice real analysis at scale
  16. 16. Per-user information• Rows keyed by Wikipedia user id or IP address• Statistics for several metrics on all edits made by each user
  17. 17. Introducing• Scala language allows declarative statements• Easier to express transformations over your data in an intuitive way• Integrates with Java and runs on the JVM• Supports interactive evaluation
  18. 18. Example: Iterating over listsdef printChar(ch: Char): Unit = { println(ch)}val lst = List(a, b, c)lst.foreach(printChar)
  19. 19. … with an anonymous functionval lst = List(a, b, c)lst.foreach( ch => println(ch) )• Anonymous function can be specified as argument to foreach() method of a list.• Lists, sets, etc. are immutable by default
  20. 20. Example: Transforming a listval lst = List(1, 4, 7)val doubled = lst.map(x => x * 2)• map() applies a function to each element, yielding a new list. (doubled is the list [2, 8, 14])
  21. 21. Example: Filtering• Apply a boolean function to each element of a list, keep the ones that return true:val lst = List(1, 3, 5, 6, 9)val threes = lst.filter(x => x % 3 == 0)// ‘threes’ is the list [3, 6, 9]
  22. 22. Example: Aggregationval lst = List(1, 2, 12, 5)lst.reduceLeft( (sum, x) => sum + x )// Evaluates to 20.• reduceLeft() aggregates elements left-to-right, in this case by keeping a running sum.
  23. 23. Crunch: MapReduce pipelinesdef runWordCount(input, output) = { val wordCounts = read(From.textFile(input)) .flatMap(line => line.toLowerCase.split("""s+""")) .filter(word => !word.isEmpty()) .count wordCounts.write(To.textFile(output))}
  24. 24. PCollections: Crunch data sets• Represent a parallel record-oriented data set• Items in PCollections can be lines of text, tuples, or complex data structures• Crunch functions (like flatMap() and filter()) do work on partitions of PCollections in parallel.
  25. 25. PCollections… of WibiRows• WibiRow: Represents a row in a Wibi table• Enables access to sparse columns• … as a value: row(someColumn): Value• … As a timeline of values to iterate/aggregate: row.timeline(someColumn): Timeline[Value]
  26. 26. Introducing: Kiyan Ahmadizadeh
  27. 27. Demo!
  28. 28. Demo: Visualizing Distributions• Suppose you have a metric taken on some population of users.• Want to visualize what the distribution of the metric among the population looks like. – Could inform next analysis steps, feature selection for models, etc.• Histograms can give insight on the shape of a distribution. – Choose a set of bins for the metric. – Count the number of population members whose metric falls into each bin.
  29. 29. Demo: Wikimedia Dataset• We have a user table containing the average delta for all edits made by a user to pages.• Edit Delta: The number of characters added or deleted by an edit to a page.• Want to visualize the distribution of average deltas among users.
  30. 30. Demo!
  31. 31. Code: Accessing Dataval stats = ColumnFamily[Stats]("edit_metric_stats")val userTable = read( From.wibi(WibiTableAddress.of("user"), stats))
  32. 32. Code: Accessing Dataval stats = ColumnFamily[Stats]("edit_metric_stats")val userTable = read( From.wibi(WibiTableAddress.of("user"), stats))Will act as a handlefor accessing thecolumn family.
  33. 33. Code: Accessing Dataval stats = ColumnFamily[Stats]("edit_metric_stats")val userTable = read( From.wibi(WibiTableAddress.of("user"), stats)) Type annotation tells WDL what kind of data to read out of the family.
  34. 34. Code: Accessing Dataval stats = ColumnFamily[Stats]("edit_metric_stats")val userTable = read( From.wibi(WibiTableAddress.of("user"), stats))userTable is a PCollection[WibiRow] obtained by reading thecolumn family “edit_metric_stats” from the Wibi table“user.”
  35. 35. Code: UDFsdef getBin(bins: Range, value: Double): Int = { bins.reduceLeft ( (choice, bin) => if (value < bin) choice else bin )}def inRange(bins: Range, value: Double): Boolean = range.start <= value && value <= range.end
  36. 36. Code: UDFsdef getBin(bins: Range, value: Double): Int = { bins.reduceLeft ( (choice, bin) => if (value < bin) choice else bin )}def inRange(bins: Range, value: Double): Boolean = range.start <= value && value <= range.endEveryday Scala function declarations!
  37. 37. Code: Filteringval filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta")}
  38. 38. Code: Filteringval filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta")} Boolean predicate on elements in the PCollection
  39. 39. Code: Filteringval filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta")} filtered is a PCollection of rows that have the column edit_metric_stats:delta
  40. 40. Code: Filteringval filtered = userTable.filter { row => // Keep editors who have edit_metric_stats:delta defined !row(stats).isEmpty && row(stats).get.contains("delta")}Use stats variable we declared earlier to access the column family. val stats = ColumnFamily[Stats]("edit_metric_stats")
  41. 41. Code: Binningval binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean))}.count()binCounts.write(To.textFile("output_dir"))
  42. 42. Code: Binningval binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean))}.count()binCounts.write(To.textFile("output_dir")) Map each editor to the bin their mean delta falls into.
  43. 43. Code: Binningval binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean))}.count()binCounts.write(To.textFile("output_dir"))Count how many times each bin occurs in the resulting collection.
  44. 44. Code: Binningval binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean))}.count()binCounts.write(To.textFile("output_dir")) binCounts contains the number of editors that fall in each bin.
  45. 45. Code: Binningval binCounts = filtered.map { row => // Bucket mean deltas for histogram getBin(bins, abs(row(stats).get("delta").getMean))}.count()binCounts.write(To.textFile("output_dir")) Writes the result to HDFS.
  46. 46. Code: VisualizationHistogram.plot(binCounts, bins, "Histogram of Editors by Mean Delta", "Mean Delta", "Number of Editors", "delta_mean_hist.html")
  47. 47. Analysis Results: 1% of Data
  48. 48. Analysis Results: Full Data Set
  49. 49. Conclusions + + = Scalable analysis of sparse data
  50. 50. Aaron Kimball – aaron@wibidata.comKiyan Ahmadizadeh – kiyan@wibidata.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×