Big Data Overview 2013-2014
Upcoming SlideShare
Loading in...5
×
 

Big Data Overview 2013-2014

on

  • 1,318 views

At the Technology Trends seminar, with HCMC University of Polytechnics' lecturers, KMS Technology's CTO delivered a topic of Big Data, Cloud Computing, Mobile, Social Media and In-memory Computing.

At the Technology Trends seminar, with HCMC University of Polytechnics' lecturers, KMS Technology's CTO delivered a topic of Big Data, Cloud Computing, Mobile, Social Media and In-memory Computing.

Statistics

Views

Total Views
1,318
Views on SlideShare
1,317
Embed Views
1

Actions

Likes
1
Downloads
63
Comments
0

1 Embed 1

https://www.facebook.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • ActiveInsight

Big Data Overview 2013-2014 Big Data Overview 2013-2014 Presentation Transcript

  • 1BIG DATAKaushal Amin, Chief Technology OfficerKMS Technology – Atlanta, GA, USA
  • AGENDA• What is Big Data• Why not RDBMS• NoSQL• NewSQL• Performance Comparison• Case Studies2
  • WHAT IS BIG DATA
  • WHAT IS BIG DATA?4“Big data exceeds the reach of commonly usedhardware environments and software tools tocapture, manage, and process it with in a tolerableelapsed time for its user population.” - TeradataMagazine article, 2011“Big data refers to data sets whose size is beyond theability of typical database software tools tocapture, store, manage and analyze.” - The McKinseyGlobal Institute, 2011Volume and Variety of Data that is difficult to manageusing traditional data management technology
  • WHAT IS GENERATING BIG DATA?Homeland SecurityReal Time SearchSocialeCommerceUser Tracking &EngagementFinancial Services5
  • HOW MUCH DATA?• 7 billion people• Google processes 100 PB/day; 3 million servers• Facebook has 300 PB + 500 TB/day; 35% of world’sphotos• YouTube 1000 PB video storage; 4 billion views/day• Twitter processes124 billion tweets/year• SMS messages – 6.1T per year• US Cell Calls – 2.2T minutes per year• US Credit cards - 1.4B Cards; 20B transactions/year6
  • LOWER COST OF STORAGE7What can I buy for $100 (USD) ?(not adjusted for inflation)Memory Capacity =128 GB by 2020x1420 in 20 yearsDisk Capacity =10 TB by 2020x1000 in 20 years
  • HOW IS BIG DATA DIFFERENT?• Automatically generated by a machine– (e.g. Sensor embedded in an engine)• Typically an entirely new source of data– (e.g. Use of the internet)• Not designed to be friendly– (e.g. Text streams)• May not have much values– Need to focus on the important part8
  • WHO UTILIZES IT?• Companies and organizations who can leverage largescale consumer produced data– Marketing– Consumer Markets (retail, airlines, hotels, Amazon, Netflix)– Social Media (Facebook, Twitter, YouTube, LinkedIn)– Search Providers (Google, Yahoo, Microsoft)– People Data Aggregators (LexisNexis, Equifax, Acxiom)• Other Enterprises are slowly getting into it– Healthcare– Financial Institutes9
  • WHY NOT RDBMS?
  • TYPE OF DATA• Structured Data (Transactions)• Text Data (Web Content)• Semi-structured Data (XML)• Unstructured Data– Social Network, SMS, Audio, Video• Streaming Data– You can only scan the data once as it travels on network11
  • WHAT TO DO WITH THESE DATA?• Aggregation and Statistics– Data warehouse and OLAP• Indexing, Searching, and Querying– Keyword based search– Pattern matching (XML/RDF)• Knowledge discovery– Data Mining– Statistical Modeling12
  • RDBMS LIMITATIONS• Very difficult to scale horizontally (more boxes) as thebest way to scale is vertically by utilizing bigger box– Physical limited to CPUs, Disk storage, and memory– Large servers are too expensive and still can’t scale• Requires structure of tables with rows and columns– Does not deal well with unstructured data• Relationships have to be pre-defined through schema– Difficult to add newly discovered data quickly13
  • NOSQL
  • NOSQL CHARACTERISTICS• Cheap, easy to implement (open source)– Cluster of cheap commodity servers with cheap storage• Data are replicated to multiple nodes (thereforeidentical and fault-tolerant) and can be partitioned– Down nodes can easily be replaced while cluster is operational– No single point of failure• Easy to distribute• Dont require a schema• Massive Scalability• Relaxed the data consistency requirement (CAP) –less locking and resource contengency15
  • NOSQL – SEVERAL OPTIONS• Currently 150 implementations and growing(http://nosql-database.org/)• Multiple Types based on storage architecture– Key-Value– Document– Column Family– Graph16
  • KEY-VALUE STORE• Values stored in Key-Value Pairs in hashmap• Distributed across nodes based on key• Simple Operations: insert, fetch, update, and delete• Best for storing high volume dataset with lowcomplexity (simple data model)• Some of the market leaders:– Riak– Amazon Dynamo– Voldermort17
  • KEY-VALUE STORE18
  • COLUMN FAMILY STORE• Stores family of columns• Columns are stored as Key-Value pair• A super column is like a catalogue or a collection of othercolumns• Columns within a family can be distributed across nodes• Supports semi-structured data with high scalability• Some of the market leaders:– HBase– Cassandra19
  • COLUMN FAMILY STORE (HBASE)20
  • DOCUMENT STORE• Supports more complex data model than Key-Value• Collection of Documents – JSON, XML, other semi-structured formats• A document is a key value collection• Multi-Index support• Best for storing complex data model but less scalable• Some of the market leaders:– MongoDB– CouchDB– SimpleDB21
  • DOCUMENT STORE22
  • GRAPH DATABASE• Social Graph with Relationship between Entities• Great for Social Networks– Facebook friends network– LinkedIn connections network• Some of the market leaders:– Neo4j– FlockDB– Pregel23
  • GRAPH DATABASE - EXAMPLE24• Nodes represent entities suchaspeople, businesses, accounts,or any other item you mightwant to keep track of.• Properties are pertinentinformation that relate tonodes such asname, age, DOB, gender.• Edges are the lines thatconnect nodes to nodes ornodes to properties and theyrepresent the relationshipbetween the two.
  • NEWSQL
  • NEWSQL• Argument is that Relational Model is not the problem for lack ofscalability but the physical implementation limitations• Development of new relational database products and servicesdesigned to bring the benefits of the relational model to distributedarchitectures• Three Approaches:– Optimized MySQL storage engines (ScaleDB, MemSQL, Akiban)– New SQL databases (Clusterix, VoltDB, NuoDB)– Sharding Middleware to split RDBMS across nodes(ScaleBase, Scalearc, dbShards)26
  • PERFORMANCE COMPARISON
  • SOURCE AND APPROACH• Independent testing done by Altoros Systems Inc.• More details athttp://www.networkworld.com/news/tech/2012/102212-nosql-263595.html?page=1• Using Amazon virtual machines to ensure verifiable results andresearch transparency (which also helped minimize errors due tohardware differences)– Riak, a key-value store– Cassandra, a column family store– Hbase, a column family store– MongoDB, a document-oriented database– MySQL Cluster, a NewSQL– Sharded MySQL, a NewSQL28
  • PERFORMANCE ON WRITE29
  • 30PERFORMANCE ON READ
  • CASE STUDIES
  • 32EXAMPLE: HEALTHCAREA health care consultancy has made the data coming out of medical practicesthe focus of its thriving business. The company collects billing and diagnosticcode data from 10,000 doctors on a daily, weekly and monthly basis to createa virtual clinical integration model. The consulting company analyzes the datato help the groups understand how well they are meeting the FTC guidelinesfor negotiating with health plans and whether they qualify for enhancedreimbursement based on offering a more cost-effective standard of care.It also sends them automated information to better take care of patients, likecreating an automated outbound calling system for pediatric patients whoweren’t up to date on their vaccinations.
  • 33EXAMPLE: RETAILWalmart handles more than 1 million customer transactions everyhour, which is imported into databases estimated to contain more than 2.5petabytes * of data — the equivalent of 167 times the informationcontained in all the books in the US Library of Congress.
  • 34EXAMPLE: UTILITYWith a smart meter, a utility company goes from collecting one data pointa month per customer (using a meter reader in a truck or car) to receiving3,000 data points for each customer each month, while smart meterssend usage information up to four times an hour.One small Midwestern utility is using smart meter data to structureconservation programs that analyze existing usage to forecast futureuse, price usage based on demand and share that information withcustomers who might decide to forestall doing that load of wash untilthey can pay for it at the nonpeak price.
  • 35GROWTH FORECAST
  • 36 36
  • © 2013 KMS TechnologyQ&A