Big data

  • 254 views
Uploaded on

 

More in: Software , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
254
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
22
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • I like good movies & books.
    So obviously I like Sherlock Holmes books + movies + TV series and also Star Trek (movies + TV series)
    BigData is elementary and clear for Star Trek’s Android Data or the super detective Sherlock Holmes 
    Why? Because both fictional characters are possessing the power to process vast amount of information, visualize, analyze and join content really fast!
  • Let’s past really fast over this slide…
  • What has Gartner to say about BigData?
    Are you using Facebook?
    I do not believe that BigData is a hype
  • Google – 1M+ servers in several datacenters / storing part of the index in RAM
    Amazon (retail website, Amazon DynamoDB, EMR, RedShift), eBay
    Twitter, Facebook, LinkedIn, FourSquare
    VMware, Cloudera, IBM, HP, Oracle, SAP, EMC (Pivotal), MapR, Hortonworks, Microsoft
  • At Amazon I particularly worked with Analytics and Visualization
    Visualization is in demand because it makes data-analysis easier.
  • un-structured data = 80% of the data in the world (and the percentage is increasing)
  • As the previous slides presented BigData concept, let’s move on to Hadoop – one of the many frameworks that are widely used for processing it!
    Today Google is obviously much more advanced than back in 2004, too bad that Google’s BigQuery service is very expensive but is near-realtime, something that other players on the market cannot offer yet!
  • Quick example about how MapReduce is working, next slide will show a more complex one.
    Goal: count the number of books in the library.
    Map: You count up the odd-numbered shelves, I count up the even numbered shelves. (The more people we get, the faster this part goes. )
    Reduce: We all get together and add up our individual counts.
  • Input file is split on blocks on HDFS
    The colored spots represents types of rows
    Mapper phase – each block is read by a Map, which produces KV (color, number = 1)
    Combiner phase – is an optional step – not presented in the diagram (its scope is to operate over Mapper output and “reduce” the size, similar to the Reducer phase)
    Shuffle / Sorting / Partitioning phase
    Reducer phase – gets all values for a specific key (that’s the scope of previous Shuffle / Sorting phase) and keys in increasing order
    Reducer output is stored again on HDFS, in an output folder under some predefined Part-000N files (that can be used as inputs for a subsequent MapReduce job)
  • Coding example, this very simple example presents the WordCount MR job Java code.
    Usually a real MR job is composed by 3 files:
    Map class
    Reducer class
    Driver class (that starts the job)

Transcript

  • 1. Elementary, my clear …BigData "Data! Data! Data!" he cried impatiently. "I can't make bricks without clay." Sherlock Holmes – The Adventure of the Copper Beeches Constantin Ciureanu Software Architect (ciureanu@optymyze.com) Project “Hadoop Data Transformation”
  • 2. Small Data • If Excel cannot handle that amount of data – doesn’t mean it’s BigData  • “Small data” usually means: – Very clear & structured data – Easy to understand datasets content, join with other datasets • Keeping data small means: – Store just a limited amount of records / only relevant columns – Process / Sample / Aggregate, then store the results and drop the rest (so the original raw data is lost, hence further analysis is not possible anymore, creating new metrics is almost impossible) – Drop old partitions periodically
  • 3. Intro - Interesting facts • The volume of business data worldwide, across all companies, doubles every 1.2 years (was 1.5 years) • A regular person is processing daily more data than a 16th century individual in his entire life • In the last years cost of storage and processing power dropped significantly • Bad data or poor data quality costs US businesses $600 billion annually • Big data will drive $232 billion in spending through 2016 (Gartner) • By 2015, 4.4 million IT jobs globally will be created to support big data (Gartner) • Facebook processes 10 TB of data every day / Twitter 7 TB • Google has over 3 million servers processing over 2 trillion searches per year in 2012 (only 22 million in 2000)
  • 4. De-mystifying BigData • #1 There’s no such thing as too big! Just data of unusual size  meaning that a brute force approach is out of the question • #2 Data is everywhere! You just need to collect, store and understand it! • “Big data represents a new era in data exploration and utilization.” • Sources of big datasets: – Transactions – Logs – Emails – Social media (tweets, messages, posts, likes) – Network elements – (sampling) Various events, measurements from sensors (eg. LHC) – User interactions (eg. Shopping sessions / Clicks / Scrolls & mouse hover) – Geospatial data – Audio / Images / Video – External sources (there are companies selling data)
  • 5. BigData – domain of interest • Computer science: e-commerce, Social networks, Banking and Finance Markets, Advertising, Telecom, Media, Analytics, Visualization, Machine learning, Predictions, Ads, Personalized recommendations (better customer understanding), Data mining • Statistics – analysis with better calculation precision • Health, Medicine & Biology (eg. “Decoding the human genome originally took 10 years to process, now it can be achieved in less than a week”) • Processing BigData requires using distributed architectures and algorithms (most of them based on a “Divide and conquer” approach, while others also relies heavily on sampling)
  • 6. Reasons to use BigData • Continuously decreasing of cost for hardware and storage • There are companies out there storing everything. Some making the mistake to store all logs as “big data”  • Work faster, scalable, consider every piece of data not just sampling • Sell better (have an advantage in market analysis and expand into new markets) • Few of companies do actually value BigData and are able to properly explore and use its entire potential. • “Divide and conquer” approach – need to change the way you think! • Sharded data, goes nicely along Bloom Filters concept • Very difficult to work with BigData because it’s mainly not clear, un- structured and implies using huge files
  • 7. Hadoop - Introduction • Hadoop is just a small piece of the BigData puzzle • As we all know Google is far ahead compared to the entire BigData market out there • 2004 – Google published a paper on a process called MapReduce (a framework which provides a parallel processing model and associated implementation to process huge amount of data) • Google File System (GFS - recently Colossus) -> Apache HDFS • Google MapReduce -> Apache MapReduce • Google BigTable -> Apache HBase
  • 8. Hadoop vs. Other Systems 8 Distributed Databases Hadoop Computing Model - Notion of transactions - Transaction is the unit of work - ACID properties, Concurrency control - Real-Time - Notion of jobs - Job is the unit of work - No concurrency control - Not Real-Time Data Model - Structured data with known schema - Read/Write mode - Any data will fit in any format - (un)(semi)structured - ReadOnly mode (Append mode added in Hadoop v2) Cost Model - Expensive servers - Cheap commodity machines Fault Tolerance - Failures are rare - Recovery mechanisms - Failures are common over thousands of machines - Simple yet efficient fault tolerance - Data replication - High availability - Self healing Key Characteristics - Efficiency, optimizations, fine-tuning - Scalability, flexibility, fault tolerance
  • 9. Hadoop Subprojects & Related • Pig – High-level language for data analysis • HBase – Table storage for semi-structured data • Zookeeper – Coordinating distributed applications • Hive – SQL-like Query language and Metastore • Mahout – Machine learning • Lucene, Solr, Blur – Indexing and search engines • …
  • 10. MapReduce - #1 • With MapReduce, queries are split and distributed across parallel nodes and processed in parallel (the Map step). The results are then gathered and delivered (the Reduce step). The framework was incredibly successful, so others wanted to replicate the algorithm • The same code can run on data of any size. • The code runs where the data is • Therefore, an implementation of MapReduce framework was adopted by an Apache open source project named Hadoop. • HDFS = Hadoop Distributed File System, an Apache open source distributed file system designed to run on commodity hardware • But the commodity hardware – have issues (eg. if one node fails - on average - once each 3 years, then 1000 nodes will have a failure each day – again, on average)
  • 11. MapReduce - #2
  • 12. MapReduce - example Shuffle & Sorting based on k Reduce Reduce Reduce Map Map Map Map Input blocks on HDFS Produces (k, v) ( , 1) Parse-hash Parse-hash Parse-hash Parse-hash Consumes(k, [v]) ( , [1,1,1,1,1,1..]) Produces(k’, v’) ( , 100) Part0003 Part0002 Part0001
  • 13. Example - NoSQL DB – HBase • HBase is a column-oriented database management system that runs on top of HDFS. • HBase can store massive amounts of data and allows random access to it • Can store sparse datasets / denormalized data / has flexible schemas • The tables are Key-Value storage (one single “PK”) • Get operations are generally very fast, while scan for a range of start- stop keys is incredibly fast. Hence the need to think a lot in choosing the right key “composition” to allow your application to use the range scanning • Drawbacks of HBase – no SQL support
  • 14. BigData @Amazon – Amazon stores and process more shopping data than any other single company – Offers software services to external companies via AWS (Amazon Web Services) – Amazon DynamoDB (a high performant NoSQL database) – Amazon Elastic Cloud Control (EC2), Elastic Map Reduce (EMR) – Amazon analyzes hundreds of millions of daily sales, inventory ASINs, various SKUs to compute metrics in near real-time, determine prices that maximize profit and clear inventory (literally lowest available price on-line) – Amazon uses click stream analysis, machine learning and data mining to detect robots / fraudulent behavior / optimize page content and structure, maximize income by using recommendations and price setting mechanisms – Amazon sends recommendations to customers offers (read spam ) – Optimize routes thousands of package delivery
  • 15. Questions? • And yet, this is only the beginning! • Interesting links: – http://hadoop.apache.org/docs/r0.18.0/hdfs_design.pdf – https://www.youtube.com/watch?v=qqfeUUjAIyQ – https://www.youtube.com/watch?v=c4BwefH5Ve8 – https://www.youtube.com/watch?v=HFplUBeBhcM • Expect more on Tech Herbert day!