• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Data mining & column stores

Data mining & column stores



a class presentation

a class presentation



Total Views
Views on SlideShare
Embed Views



3 Embeds 77

http://my-waz.com 57
http://atrx.posterous.com 19
http://posterous.com 1



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Data mining & column stores Data mining & column stores Presentation Transcript

    • Data Mining &Column StoresAung Thu Rha Hein
    • Why use Data Mining?• Explosive growth of data available• Major sources: • Business: Web, E-Commerce, transactions • Science : Remote Sensing, bioinformatics,…. • Society : news, gadgets, social media• Too much data but too little information• To extract useful information from the data and to interpret the data• can automate the process of finding relationships and patterns in raw data
    • What is Data Mining?• Knowledge Discovery in Databases, or ”KDD”• the process of extracting hidden predictive information from large data sets• Converting information into knowledge to predict the future trends and decisions• Examples :  consumer buying behavior of retail supermarket sales  Google instant, YouTube instant  Blogs and news: Technorati, News360 and so on  Social Mining : Livehoods: find pattern and behaviors of foursquare check-in data
    • Data Mining ProcessThe Cross-Industry Standard Process (CRISP-DM) Business understanding Data understanding Data preparation Modeling Evaluation Deployment
    • TechniquesI. Association Rule-also known as market basket analysis.  discover interesting associations between attributesII. Classification- a technique based on machine learning  use mathematical techniques such as decision trees, linear programming, neural network and statistics.III. Clustering- makes meaningful or useful cluster of objects that have similar characteristicIV. Prediction-discovers relationship between independent variables and relationship between dependent and independent variablesV. Sequential Patterns-discover similar patterns in data transaction over a business period
    • Tools• There are three categories of tools for data mining: i. Traditional Data Mining Tools ii. Dashboards iii. Text-mining ToolsSome data mining tools: • R- r-project.org • Datameer Analytics Solution - datameer.com • SAS Analytics- sas.com • Google Chart API- code.google.com/apis/chart
    • Column Stores• stores data tables as columns of data • Column Oriented DBMS- • Bigtable, DBase, Hypertable, Cassandra(Relational) • Sybase IQ, MonetDB, C-Store, Vertica, VectorWise, Infobright (NoSQL)• Use in systems like data warehouses and data mining• Example: Emp_ID Emp_Name Emp_Dept Emp_Salar y 1 Smith IT 40000 2 Adam Sales 35000 3 Jones Marketing 45000the database must coax its two-dimensional table into one for the operatingsystem • 1,2,3 Smith, Adam, Jones IT, Sales, Marketing 40000, 35000, 45000
    • Advantages and Disadvantages ofColumn StoresAdvantages• Only need to read relevant data( improved bandwidth utilization)• Improved cache locality  No need to transmit surrounding attributes• Compression efficiency-column compress better than rows  Because rows contain values from different domain  Row-store compression ratio: 1:3  Colum-Store: 1:10Disadvantages• Increased Disk seek time• Increased cost of inserts.• Increased tuple reconstruction costs
    • Case Study: Bazaarvoice• Facing difficulties to aggregate large amounts of data on the fly in real time for analytics product• Common among queries- a small number of columns with most values being aggregates such as counts, sums and averages• Use InfoBright, an open source database built on MySQL• Test result using a data set with 100MM records in the main fact table• Average query execution time for analytical queries was 20x faster than MySQL’s
    • Case Study: Bazaarvoice(cont.)• disk footprint was over 10x smaller compared to MySQL due to data compression.• Why? • Column stores- small disk I/O • “knowledge grid”, aggregate data Infobright calculates during data loading • E.g. pre-calculate min, max, and avg value for each column in the pack • Limitations of InfoBright • does not support DML • only way is to bulk loads using “LOAD DATA INFILE …” command • no way to update or delete existing data without reloading the table
    • ReferencesData Mining• http://en.wikipedia.org/wiki/Data_mining• http://www.inc.com/magazine/20101001/4-essential-data-mining-tools.html• http://www.dataminingtechniques.net/• http://www.unc.edu/~xluan/258/datamining.html• http://www.data-miners.com/• http://www.exforsys.com/tutorials/data-mining/how-data-mining-is-evolving.html• http://livehoods.org/Column Stores• http://en.wikipedia.org/wiki/Column_store• http://developer.bazaarvoice.com/why-columns-are-cool• http://www.calpont.com/doc/Calpont_Whitepaper-Best-Practices- in_the_Use_of_Columnar_Databases.pdf