Huhdoop?: Uncertain Data Management on Non-Relational Database Systems
Upcoming SlideShare
Loading in...5
×
 

Huhdoop?: Uncertain Data Management on Non-Relational Database Systems

on

  • 752 views

A year of research into uncertain data management, summed up in memes.

A year of research into uncertain data management, summed up in memes.

My apologies in advance.

Statistics

Views

Total Views
752
Views on SlideShare
310
Embed Views
442

Actions

Likes
0
Downloads
0
Comments
0

4 Embeds 442

http://toromon.com 380
http://localhost 59
https://twitter.com 2
http://www.slideee.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Huhdoop?: Uncertain Data Management on Non-Relational Database Systems Presentation Transcript

  • 1. HUHDOOP? U N C E R TA I N D A TA M A N A G E M E N T O N N O N - R E L A T I O N A L D A TA B A S E S Y S T E M S with memes
  • 2. U N C E R TA I N T Y Y U NO KNOW CERTAIN DATA POINT?! Uncertainty is inherent and prevalent
  • 3. U N C E R TA I N T Y IN DBMSES An active area of research Largely not discussed, IRL Mostly focused on the relational model (and XML)
  • 4. HADOOP B A T S H I T C R A Z Y, BUT IN A GOOD WAY
  • 5. D ATA B A S E S ON HADOOP Still need fast random access Don’t actually want to crunch files all the time
  • 6. HBASE Column-family database Part of the stack Dynamic-ish schemas
  • 7. MODEL OF U N C E R TA I N T Y
  • 8. 1 - D S E N S O R U N C E R TA I N T Y Probability Density Function U N C E R TA I N I N T E R V A L Lower Bound Upper Bound
  • 9. SIMPLE SENSOR U N C E R TA I N T Y M O D E L SENSORS FIXED ROW KEY U N C E R TA I N LOWER UPPER PDF
  • 10. U N C E R TA I N QUERIES
  • 11. DIMENSIONS VA L U E - B A S E D INDEPENDENT DEPENDENT E N T I T Y- B A S E D VA L U E S I N G L E QUERY ENTITY RANGE QUERY VA L U E S U M QUERY ENTITY MINIMUM QUERY
  • 12. VA L U E S I N G L E QUERY LIKE SPELLING YOUR NAME RIGHT ON THE SATS
  • 13. VA L U E S I N G L E QUERY Just grab a single record In HBase (shell): get 'Sensors', ‘1','Uncertain' Or in HiveQL: SELECT Lower, Upperper, PDF FROM hive_sensors WHERE id=1;
  • 14. VA L U E S U M QUERY O N LY H A R D IF YOU CAN’T ADD
  • 15. VA L U E S U M QUERY Simple In HiveQL: SELECT SUM(Upperper), SUM(Lower) FROM hive_sensors; Scalable! But…
  • 16. VSUMQ PDFS Single threaded Java app took 4 hours 23 minutes over only 1,000 records! 10,000 records proved impossible
  • 17. VSUMQ S T R AT E G I E S Just calculate regularly Cache it in Hive Reduces latency from 1048 seconds to 8 seconds Data staleness likely irrelevant for an aggregate of uncertain records
  • 18. ENTITY RANGE QUERY 4 TIMES THE WORK SAME NUMBER OF CREDIT HOURS
  • 19. ENTITY RANGE QUERY CLASS 1 CLASS 2 CLASS 3 CLASS 4 Lower Bound Upper Bound
  • 20. ERQ IN HIVEQL Class 1 SELECT Sensor_id, (Upper-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Upper>=10 AND Upper<=20 and Lower <=10; Class 2 SELECT Sensor_id, 1 AS probability FROM hive_sensors WHERE Upper<=20 and Lower >=10; Class 3 SELECT Sensor_id, (20-Lower)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower>=10 AND Upper>=20 and Lower <=20; Class 4 SELECT Sensor_id, (20-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower<=10 AND Upper>=20;
  • 21. A G G R E G AT E E R Q SELECT Sensor_id, (Upper-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Upper>=10 AND Upper<=20 and Lower <=10 UNION ALL SELECT Sensor_id, 1 AS probability FROM hive_sensors WHERE Upper<=20 and Lower >=10 UNION ALL SELECT Sensor_id, (20-Lower)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower>=10 AND Upper>=20 and Lower <=20 UNION ALL SELECT Sensor_id, (20-10)/(Upper-Lower) AS probability FROM hive_sensors WHERE Lower<=10 AND Upper>=20;
  • 22. SIMPLIFIED ERQ SELECT Sensor_id FROM hive_sensors WHERE (Upper>=10 AND Upper<=20 AND Lower<=10) OR (Upper<=20 and Lower>=10) OR (Lower>=10 AND Upper>=20 and Lower<=20) OR (Lower<=10 AND Upper>=20); ! Reduces to: SELECT * FROM hive_sensors WHERE Upper>=10 AND Lower<=20; * Just the intervals
  • 23. ENTITY RANGE QUERY O P T I M I Z AT I O N S
  • 24. ARBITRARY ROW KEYS SENSORS FIXED U N C E R TA I N ROW KEY LOWER UPPER PDF 1234 42 63 UNIFORM
  • 25. NON-ARBITRARY ROW KEYS SENSORS FIXED U N C E R TA I N ROW KEY LOWER UPPER PDF 42631234 42 63 UNIFORM
  • 26. PERFORMANCE
  • 27. D ATA I N C O L U M N F A M I L I E S SENSORS FIXED U N C E R TA I N _ L O W E R U N C E R TA I N _ U P P E R U N C E R TA I N ROW KEY LOWER LOWER_40 UPPER UPPER_60 PDF 1234 42 1 63 1 UNIFORM
  • 28. D ATA I N C O L U M N FA M I L I E S Have to use column-families, not just columns Does handle 2-dimensional uncertainty Bloom filters obviously help Query syntax gets complicated
  • 29. ENTITY MINIMUM QUERY LIKE THE KING OF THE DOWNVOTED
  • 30. EMINQ HIVE + JYTHON I M P L E M E N TAT I O N ... r1 = statement.executeQuery( "SELECT MIN(Upper) FROM hive_sensors;") result = statement.executeQuery( "SELECT * FROM hive_sensors WHERE Lower < {0};”.format(r1)) ...
  • 31. E M I N Q P I G I M P L E M E N TAT I O N test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage( ‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); ! grouped = GROUP test_sensors ALL; minup = FOREACH grouped GENERATE MIN(test_sensors.up_val); ! inrange = FILTER test_sensors BY (down_val < minup.$0); dump inrange;
  • 32. EMINQ PERFORMANCE
  • 33. CASSANDRA NOT HADOOP JUST USEFUL
  • 34. SECONDARY INDEXES ON CASSANDRA CREATE TABLE sensors ( Sensor_id int, Lower float, Upper float, PDF text, PRIMARY KEY (Sensor_id) ); ! CREATE INDEX sensors_down ON sensors (Lower); ! CREATE INDEX sensors_up ON sensors (Upper);
  • 35. OPEN QUESTIONS I SPENT A YEAR OF MY LIFE ON THIS STUFF: AMA!
  • 36. R E P O S T: E M I N Q P I G I M P L E M E N TAT I O N test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseStorage( ‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); ! grouped = GROUP test_sensors ALL; minup = FOREACH grouped GENERATE MIN(test_sensors.up_val); ! inrange = FILTER test_sensors BY (down_val < minup.$0); dump inrange;
  • 37. EMINQ FILE-BASED REWRITE Using HBase test_sensors = load 'hbase://u_1' using org.apache.pig.backend.hadoop.hbase.HBaseSto rage(‘Fixed:Sensor_id, Uncertain:Upper, Uncertain:Lower, Uncertain:PDF', '-loadKey true') as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray); Using Files test_sensors = load 'uncertain_data_file' as (ID:bytearray, Sensor_id:int, up_val:float, down_val:float, pdf:chararray);
  • 38. E M I N Q & A L L F U L L TA B L E Q U E R I E S
  • 39. CREDITS University of Hong Kong Computer Science Department ! Reynold Cheng-research supervision, academic instruction Ben Kao-research evaluation ! Liu Lu-research, software implementation Wang Zuyao-research, software implementation
  • 40. ME @jeffksmithjr toromon.com