Big data: analyzing large data sets


Published on

Published in: Technology
1 Comment
No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data: analyzing large data sets

  1. 1. BIG DATAANALYZING LARGE DATA SETSRajendra Akerkar Vestlandsforsking, Norway
  2. 2. ‘Big’ Data g2  When volume, velocity and variety of data exceeds an organisation’s storage or compute capacity for accurate and timely decision making  The data is diverse  Users create content: blog posts, tweets, social network interactions, and photos i t ti d h t  Servers constantly log messages on what theyre doing  Researchers create measurements of the world around us  The vital source of data is extremely large R. Akerkar
  3. 3. Data3  Definition of data is very generic  Data is inherently time based  Suppose John writes into his LinkedIn profile that he lives in Oslo  Suppose that next month John updates his profile location to Bergen  The fact that he lives in Bergen now doesn t change the fact that doesnt he used to live in Oslo  Both pieces of data are true  Data is inherently inalienable  Since data is connected to a point in time, the trustworthiness of a piece of data never changes R. Akerkar
  4. 4. Two key operations y p4 From Create-read-update-delete To Create-read  No "Update" operation updates don t make sense with Update operation: dont immutable data  E.g., "updating" John’s address means adding a new piece of data of a more recent time  No "Delete" operation: also represented as creating new data  E.g., if Rita stops following John on Twitter, that doesnt change the fact that she used to follow him R. Akerkar
  5. 5. A new kind of technologies g5  Limitations of Traditional database systems (say, relational databases)  Failed to scale to Big Data g  NoSQL  Scaleto enormously larger sets of data  However, using these technologies effectively requires a primarily new set of techniques R. Akerkar
  6. 6. Examples p6  Innovative technologies to build robust and scalable Big Data systems  Distributed file systems systems,  MapReduce computation framework  Distributed locking services AAmazon’s di t ib t d key-value store: D ’ distributed k l t Dynamo  Hadoop  Hbase  RabbitMQ  MongoDB  Cassandra R. Akerkar
  7. 7. Storage and retrieval capability g p y7  Hadoop Distributed File System (HDFS): p y ( )  An Apache open source distributed file system  Expected to run on high-performance commodity hardware  Known for highly scalable storage and automatic data replication across three nodes for fault tolerance  Automatic data replication across three nodes eliminates need for backup  Write once, read many times  R. Akerkar
  8. 8. Database capability p y8  Oracle NoSQL  Dynamic and flexible schema design.  High performance key value pair database. Key value pair is an alternative to a pre-defined schema. Used for non-predictive and dynamic data data.  Able to efficiently process data without a row and column structure. Major + Minor key paradigm allows multiple record reads in a single API call  Highly scalable multi-node, multiple data center, fault tolerant, ACID operations  Simple programming model, random index reads and writes  Not Only SQL. Simple pattern queries and custom-developed solutions SQL custom developed to access data such as Java APIs.  ml R. Akerkar
  9. 9. Processing and integration capability g g p y9  MapReduce:  Defined by Google in 2004  Break problem up into smaller sub-problems sub problems  Distribute data workloads across thousands of nodes  Can be exposed via SQL and in SQL-based BI tools p  ntrusted_dlcp/ educe-osdi04.pdf R. Akerkar
  10. 10. Processing and integration capability g g p y10  Apache Hadoop:  Leading MapReduce implementation  Highly scalable parallel batch processing  Highly customizable infrastructure  Writes multiple copies across cluster for fault tolerance p p R. Akerkar
  11. 11. Data integration capability g p y11  Oracle Big Data Connectors, Oracle Loader for Hadoop, Oracle Data Integrator:  Exports p MapReduce results to RDBMS, Hadoop, and p , p, other targets  Connects Hadoop to relational databases for SQL processing  Includes a graphical user interface integration designer that generates Hive scripts t move and d i th t t Hi i t to d transform MapReduce results  Optimized processing with parallel data import/export R. Akerkar
  12. 12. Statistical analysis capability y p y12  Open Source Project R & Oracle R Enterprise  Programming language for statistical analysis  Introduced into Oracle Database as a SQL extension to perform high performance in database statistical analysis  Oracle R Enterprise allows reuse of pre-existing R scripts with no modification R. Akerkar
  13. 13. Scale a traditional database13 to Bi D t t Big Data  Partition tolerance Partition-tolerance A partition-tolerant system sustains its properties even in the event of one or more messages fail to get delivered between any two nodes in the system  Consistency  Once you do a write, future reads will take that write into account  Availability  One can always read from and write to the system R. Akerkar
  14. 14. Brewer’s CAP theorem14  A method for exploring the relationship between above three key properties of data systems  It states, that though its desirable to have Consistency, states Consistency High-Availability and Partition-tolerance in every system, alas no system can achieve all three at the same time time.  Recognizing which of the “CAP” rules your business g g y really needs should be the first step in building any successful distributed, scalable, highly-available system R. Akerkar
  15. 15. Properties of Big Data systems p g y15  Low latency reads and updates  To be able to accomplish low latency reads and updates without compromising the robustness of the system  Robust and fault-tolerant  Scalable  The ability to maintain performance in the face of increasing data and/or load by adding resources to the system  Extensible E t ibl  Functionality to be added with a minimal development cost  Allows ad hoc queries  Less maintenance  Select components that have as small an implementation complexity as possible R. Akerkar
  16. 16. Rationale behind a data system y16 Q Query = Function(All data) y ( )  Two notions: "data" and "queries“ data queries  A d t system answers data t  Questions ("queries“) about a dataset  Query is a derivation from a set of data  Queries are functions of the complete dataset R. Akerkar
  17. 17. Big Data architecture g17  Big Data systems as 3 layered cake  Batch layer stores raw data in flat files  Data is inalienable and append-only  The batch layer runs in a while(true) loop continuously  calculating the views for the serving layer  The speed layer is its own independent system that only operates on the last few hours of data  It uses more traditional random read / random write databases to store its indexes  Applications resolve queries by querying both the batch indexes and speed layer indexes and merging them together R. Akerkar
  18. 18. Big data sources g18  Media/entertainment industry is collecting large amounts of rich content and user viewing b h i d i i behaviors  Healthcare industry is quickly moving to electronic medical records and images, which it wants to use for short-term public health monitoring and long-term epidemiological research programs g p g p g  Low-cost gene sequencing (<$1,000) can generate tens of terabytes of information that must be analyzed to look for genetic variations and potential treatment effectiveness  Video Vid surveillance i still transitioning from CCTV to IPTV cameras and ill is till t iti i f t d recording systems that organizations want to analyze for behavioral patterns (security and service enhancement)  Sensor data is being generated at an accelerating rate from fleet GPS gg g transceivers, RFID tag readers, smart meters, and cell phones (call data records [CDRs]); that data is used to optimize operations and drive operational business intelligence (BI) to realize immediate business pp opportunities R. Akerkar
  19. 19. Big data sources g19  Financial transactions  With the strengthening of global trading environments and the better practice of programmed trading, the volume of transactions that need to be collected and analyzed can double in size while the transaction volumes can also fluctuate much size, faster, much wider, and much more unpredictably, and competition among enterprises force trading decisions to be made at ever smaller intervals  Smart instrumentation  The use of intelligent meters in "smart grid energy systems that smart grid" shift from a monthly meter read to an "every 20 minute" meter read can translate into a multi-thousand-fold increase in data generated R. Akerkar
  20. 20. Potential business value of big data g20  High Resolution Management  the detailed workings of processes so that the way those processes are designed and managed can be changed to take advantage of more information  Big data can help identify events that can be early warnings of problems or indicators of opportunities to optimize a process  The improved model that big data can help create can be the foundation for better predictions  When big data is examined in real time it can form the basis for Operational Intelligence systems R. Akerkar
  21. 21. Benefits of Big Data g21  Applications will be more robust  Performance will be more predictable  Collect more data and get more value out of your data  Huge opportunities to mine your data, produce data analytics, and build new applications R. Akerkar
  22. 22. 22 Thank you R. Akerkar