Framework to cluster structured data and a recovery mechanism

Analytic business applications are the trend defining applications that define the worth of data at a particular time. For any business organization it is important to keep record and log every day business activities, strategies and the reasons behind a decision. These may be the ingredient to a wiser business decision in future. In traditional IT enterprise traditions the ‘database’ is the term that commonly refers to relational databases. The power that is induced in the data by adding structure to it makes it so powerful that no unstructured data management tool can handle it. Hadoop tried to treat structured and unstructured data in a same way (i.e. it tried to use the unstructured data management tool on structured data) but the cost of query execution went up and off the ceilings. Map reduce is a technique best suited to store and query unstructured data but since Hadoop claims to handle both structured and unstructured data; this research book illustrates a method that can help Hadoop and other big data management tools to work on structured data as effectively as they work on unstructured. Performance of traditional data management tools drops when it comes to running cross table analytical queries on structured data in distributed processing environment; response time to these data management tools are high because of the ill-aligned data sets and complex hierarchy of distributed computing environment. Data alignment requires a complete shift in data deployment paradigm from row oriented storage layout to column oriented storage layout, and complex hierarchy of distributed computing environment can be handled by keeping metadata of entire data set. Research proposes an approach to ease the deployment of structured data into the distributed processing environment by arranging data into column-wise combinational entities. Response time to analytical queries can be lowered with the support of two concepts; Shared architecture and Multi path query execution. Highly scalable systems are Shared Nothing architecture based but degradation in performance and fault tolerance are the side effects that came with high scalability. Proposed method is an effort to balance the equation between scalability, performance and fault tolerance. And due to the limited scope of this research we concentrate on issues and solutions for structured data only. Shared architecture and active backup helps improving the system’s performance by sharing the work-load-per-node. This clustering methodology sheds the data pressure points to minimize the data loss per node crash.