In today’s world, day by day new ways to access the database are being discovered which are easily available on the Web. During research I am going to propose a comprehensive solution by which people could easily access all the databases in a unified manner and can get all the information they want, automatically. Our main aim is to improve the way we presently acquire information from Web such that retrieving information from the Web can be much easier and faster.
The growth rate of WorldWideWeb (also as “WWW”) has been phenomenal. Simplicity has been the main reason for the huge success of Web. Another reason for its success is the compatibility feature which provides users the function to browse or download multimedia documents available across different platforms, and it just takes little participation between the users and information providers.
The Web has been rapidly developing, and a lots of new accessible database are available on the Web. Researchers call such databases as Web Applied database (WDB). Web is divided into : Deep Web and Surface Web. Surface Web (also called as the Clearnet, indexable Web or the visible Web) is that part of the WorldWideWeb (WWW) . It is indexable by conventional search engines. In other words, surface Web can also be referred as static Web pages that are visible to the search engines. Deep Web is World Wide Web (WWW) that is not non-indexable by conventional engines (search). In other words, deep Web can also be referred as dynamic Web pages that are not visible to the standard search engines.
Query interfaces hide the vast information stored in Web databases; it acts as a barrier/interface between all the information and the user who wants to retrieve that information. So if a user wants to access the information stored in the Web database, it has to do it through the query interface. Let’s take an example, Amazon is a very big and popular e-commerce Web site that provides a query interface shown in figure 1.
In 2004, UIUC conducted a survey on the total number of Web databases and query interfaces. More than “300,000” Web databases and “450,000” query interfaces available , and these numbers have increased phenomenally through the years. Not only the number of Web databases is increasing but also the information on every topic stored in the Web databases is increasing at an exponential rate. By some of the Deep Web portal service Deep Web directories are provided which are used to classify Web databases. For example, there are total numbers of 42 topics which have been collected and classified by more than 7000 Web databases; this figure has been surveyed by CompletePlanet. As we can see that, the Web databases are just like a big store house which is used to store all the information about every topic and hence, acting as a great service and providing everyone a chance to get what they want.
Let us take an example explaining why it is necessary to integrate a Web database. Suppose James wants to buy a mobile phone. There are a few steps she has to complete so as to buy the least expensive phone present on the Web. Firstly, he must search for all the Website that sells mobile phones, and he must find as many sites as he can so as to save money. Secondly, he must fill all the relevant details given on the Website regarding the mobile phone (for example, write “Motorola X” in the mobile name category) and then click on the Search button. Thirdly, after we have submitted our request to the server, in return we get response from the server in the form of Web pages which contains our results generated by the queries we sent, he then goes through all the response pages and selects the best phone available on those Web pages. As you would have noticed that filling all the details on all those web sites again and again just to buy a simple phone, the entire process takes a lot of time. Therefore, the whole manual approach taken by the people trying to find something on the Web and then querying over them is a quite difficult process to follow every time we want to find something.
It is necessary and very important that we find a way to integrate all the available Web databases and to try unifying them so that people can easily and automatically retrieve all the information they want from the Web. In my search to learn what makes the Web databases distinct from other heterogeneous data sources, I landed on four important characteristics:
- Heterogeneity: Web pages, query interfaces and response pages are designed for different Web sites by different people; there is no specific design that has to be followed by the designers to create a whole Website. As we start comparing various Websites and clearly see that even in the same topic, the design of the whole Web pages are often different.
- Scale: There are millions of Web databases present in the Web right now, and the number is increasing by the minute. Every single topic has so much information stored in all the Web databases on the Web, that a person would surprise to even know that so much information even existed.
- Access through query interface: We can access a Web database which is designed in such a manner that it cannot be stored or captured directly by any user. If we want to retrieve any information from the Web, then we have to go through the query interfaces provided by the Web databases and see our respective result via response pages generated by the Web databases.
- Dynamic: Firstly, the Web contains a wide range of Web databases and which are spread sparsely all over the Web. The Web databases have a tendency of vanishing and reappearing on the Web. So searching for some specific topic on a specific Web database is as difficult as trying to find the number of water drops in an ocean. Secondly, information that is stored on the Web databases is refreshed or updated on a regular basis, so as to keep this virtual world in sync with the real world. For example train timings, movie hall timings are both which changes on daily basis. Thus we keep updating our Web databases to provide our users the right and correct information every time and side by side deleting all the old/outdated information.
So to summarize, the aim of my research paper is to provide an efficient and effective way to allow users to retrieve all the information they want from the Web database.
My research paper is going to put all the attention on the challenges faced by the Web integration system and on how I address and overcome all those challenges with an appropriate solution. In the paper, I present to you a comprehensive answer to all the challenges which are faced during the creation of the Web database integration system.
Nowadays Web services are being provided by which we can access web database of that website. These services help users to get the required information accessible in an easy and efficient manner. But there are 2 problems which arise with providing Web services:
- Web services which are provided to access the Web databases is a very small part of some Web sites.
- All the Web services which are being provided by the Web sites depend on a customized program which is not that easy to operate by the common people.
So in this research paper, I am going to discuss the popular approach that is on how query interfaces are used to access a Web database of a particular Web page.
This research has been divided into 4 main sections:
- A comprehensive solution for Web database Integration
- Research status in this area
- Research works
3. A Comprehensive Solution for WDB Integration
In this, we will be discussing about the approach we are going to take towards a comprehensive solution for the integration of available Web database. We can see the proposed approach that we are going to follow in this research paper in figure 2.
This solution has three primary parts that it is divided into:
- Integrated interface module’s generation
- Processing module of query
- Processing and generation of module
a. Integrated interface module’s generation:
In this module, we are going to talk about how we create an integrated interface to integrate Web databases together.
This module consists of four parts and their features are written below:
- Discovery of Web database: First of all, On the Web we are going to search for all the Web sites which have Web database working behind its surface and after doing that we are going to recognize which query interface is being used on these Web Sites.
- Extraction of query interface schema: In this step, we have to gather all the information we can about the query interface such as “Book name”, “Subject”, “Publisher” (in Figure 1), and also gathering meta-information about all the attributes which present on the Web Databases such as default value, value type etc.
- Clustering Web databases: Here we are going to differentiate all the clustered Web databases from the Web into different groups. Each group will store information that is of similar nature so we can easily differentiate Web databases .
- Interface integration: Now pick a group one by one and start merging all the same attributes/ functions of the different query interfaces and start making an integrated interface with the common attributes as global attributes.
b. Processing module of query:
After the interface has been integrated, the user will start searching for information and then we will have to process that query in each and every Web database related to that topic. This module consists of three modules and their features are written below:
- Web database selection: Here we are going to process the user’s query by carefully choosing appropriate Web databases so as to get proper, fast and minimal cost results.
- Translating queries: The input query by the user is a global attribute of the integrated interface; we need to check that query as a local attribute in every relevant Web database.
- Submitting queries: After searching and analyzing for the results on the local level of all the Web databases, each local query will be submitted.
c. Processing and generation of module:
Post processing based on input queries provided by user, next step is to fetch the required information from the Web databases. Next step will be of merging all the fetched information from the different Web databases in a global attribute and send it back to the integrated interface so that it can be visible to user. This module also consists of three parts and their features are given below:
- Extracting results: In this step, the response pages we are going to get from the Web database management will consist of the results we are expecting and then those results will be analyzed identified and extracted.
- Annotation of results: After extracting the results, we need to correctly append the results in proper semantics.
- Merging the results: In the final step, results are merged extracted from different-different databases into a global attribute for the integrated interface.
After discussing about the whole process of how we extract information from an integrated interface, we learned that components are dependent on each other as we can see in Figure 2. For example, an integrated interface generation module directly affects the query processing module. So, this tells us the quality of each and every component matters directly and indirectly.
4. RESEARCH STATUS IN THIS AREA
Huge efforts have been applied into this area till date. But these efforts were not discussed in detail because of space limit. They were discussed a summary according to the issues they addressed. Representative works were also given related to it.
Despite of all the effort that has been put in this area, the research development in this area has been uneven. There have been issues that have been very well addressed and hence we can adopt them. These are called developed issues. There have been some issues that are in their developing stages and hence need more research and some issues have not been researched at all. We present a summary of the research status according to their development status.
Interface integration: This problem has been solved using various approaches  and hence it has received enough attention. Matching the attributes of query interfaces between labels and data instances.
Extraction of query interface schema for understanding the capabilities of queries a query interface supports. A 2P language structure is build then to fathom a parsing system.
b. Developing Issues
There is a short coming of the developing issues that have been addresses below
Web database security revelation  anticipated a technique which focused on a given theme and after those picking connections that are more plausible to lead question interfaces. There is no confirmation of nature of found Web databases.  Use programmed component era to characterize competitors and C4.5 choice trees to sense inquiry interfaces. Separation of inquiry interfaces of web crawlers from that of Web databases is likewise impractical.
Web database bunching  performs the grouping in light of the elements offered on the interface page.  Proposed a goal capacity, model-separation, for figuring the likelihood. Precision relies on upon the outline of interfaces, so they are bad at managing the question interfaces with basic composition.
The Result extractions, there have been numerous approaches which have been given in order to address. Number of approaches  first convert the response page into a HTML tag tree, then they recognize and Extract data records by analyzing tree structure and tag information. The main disadvantage of these is that they can only deal with Web pages designed by HTML language.
The Result annotation, this issue is determined amid Result Extraction.  Finds the correct annotation of an extricated information thing in the reaction page by some heuristic principles. Their primary favorable position is that they are exceptionally compelling if an information thing has its annotation in the reaction page. They can’t without a doubt relate all the information sets with their annotation
The Entity identification, A very important component of data merging is entity identification for example,  the applications of a set of domain-independent string transformations to match the entities’ shared attributes in order to identify matching entities. Many approaches have been adopted to solve this problem. It is assumed that all the existing and present approaches have been achieved the well-build schema match between Web databases, but schema match in Web context has not been solved yet.
c. Undeveloped Issues
Web database choice, Query interpretation, and Data blending go under undeveloped issues. Despite the fact that these issued have been contemplated in a few connections, (for example, information distribution center), no methodology has been proposed to investigate these issue. These issues are in the connection of Web database combination. Among all these creating and undeveloped issues, Entity recognizable proof, Result extraction and Web database design determination are in my PhD pathway at present and later on, which are talked about in Section 4.
5. SOME RESEARCH JOBS
In this section, some research jobs which are presently being done or will done in future have been discussed.
a. Entity Identification among Web Databases
For Integration of data from different sources, Entity identification is the main operation and so this topic has been well studied and explored for years. There have been several proposed solutions for this Web databases as discussed earlier in the subsection 3.2, but they are all based on the assumption that there is no error in the scheme match between Web databases and they are made accurately. It’s a huge amount of work to make the schema match in web context because of the poor structure of Web pages. And because of the absence of automation it has to be done manually.
So the main aim is to try to implement the entity identification without having to make a schema match between the Web databases. The basic approach for this is as follows. We won’t try to evaluate the schema of the data records in response pages but as an alternative, we will select each data record from ‘A’ (First Database) or ‘B’ (Second Database) and ponder them as text documents. We critic data record ‘a’ (from Database A) and data record ‘b’ (from Database B) by comparing the similarity of text in them. Obviously, it is not recommended to directly calculate the text similarity between Data records because not only it is unsophisticated but it also has a lower accuracy rate in our test. The reason being, each and every part of the data record has different importance, and there is a lot of unwanted data in the data records. For example, both the words “author” and “price” appear in the book data records quite often. In order to make a more realistic similarity between ‘a’ and ‘b’, (for example, ‘a’ and ‘b’ will have greater similarity if they refer to the same entity rather than ‘a’ and ‘c’, which do not refer to the same entity), the method is carried out as follows:
- Filter maximum noise information from ‘a’ and ‘b’;
- ‘a’ is passed into several blocks and each is a query for ‘b’;
- Calculate the similarity of each block and ‘b’;
- Allocate a suitable weight for the similarity of each block and ‘b’, and sum them up;
- See whether ‘a’ and ‘b’ refer to a same entity according to the whole similarity.
At present, studies and research is being done to find an effective algorithm in order to train the weights and threshold of the whole similarity with the help of a small set of the sample data records pairs. Two data records from different Web databases are called a data record pair, and both these records refer to a same entity. The algorithm will now be discussed in detail. The very first experiment result is sustaining under the book topic. Further, the research under other topics (like a car ) will also be conducted.
b. Vision Based Result Extraction
Therefore, it is necessary to quit vision based and language independent approaches. Only, response pages with multiple data records are aimed in the current phase. The fact is that the data records in a response page are different in contents, they are similar in the appearance is our basic idea. The following is our implementations:
- achieving the vision information (like the text font, image size, an image location in the web page)by going to the program interface of Web browser;
- Using the VIP’s  algorithm creates a vision based block tree. So extracting the result there would end these blocks and would also tell that which blocks compose a data record.
- Set the data region (the region will be containing all data records in a responsive page) in the vision based block tree
- Compute the vision similarity of blocks in version based block trees to terminate the boundaries of all data records.
Thus, the primary experiment indicates that this approach is not only HTML language independent but also very suitable for picking out information rich data records.
Read more about Computer Programming Assignment Help
c. Selection of Web Database
We can find innumerable Web databases on the web and that is why a lot of Web databases are needed to be assimilated under a theme. If a data query is submitted by a user on the integrated interface and the query is sent to all the different integrated Web databases then to process that query like data cleaning and duplication, a lot of time will be wasted and overhead will be produced. Most of the time, we need to select only few amongst them in order to achieve satisfying outcomes. Hence, the main aim of the Web database Selection is to select suitable Web databases according to the user’s query on the integrated interface, which will help the users to get the desired outcomes at the lowest cost.
There are basically two aspects that must be considered in order to decide whether a Web Database must be selected to answer a given query or not. First one is pertinence of Web database system and the given query; and the second one being the query capability of the query interface of the Web database. These two aspects are discussed more in the following paragraphs.
The precondition of selecting a database is that it is related to the query. Obviously, it is pointless to query a Web database if it does not have any useful information for the query. Figure 3 gives an example to demonstrate this.
Assume that ‘A’, ‘B’, ‘C’, and “D” are four Web databases, and “q” is a question to them. Where the span of ‘A’, ‘B’, “C” and “D” is the measure of information records in them, the extent of “q” is the measure of information records that fulfill ‘q’. Instinctively, “C” does not fulfill “q” by any stretch of the imagination, “B” fulfills “q” incompletely, “An” and “D” can fulfill “q” totally, however finally “D” is the best determination contrasted and ‘A’. So we have to accomplish the elements of Web databases ahead of time. Size , the overhaul proportion, the appropriation on every quality, and so on are the elements of a Web database. The web database is hard to see straightaway on the grounds that we can’t get to a web database in any case other than through the inquiry interface. The undertaking is to figure out how to gain the elements of the inquiry interface just.
We would like to address this problem in the future by designing a sample record retriever. Sample retriever is used to obtain a small set of data records that are spread evenly in the web database. We can shape the web database by examining the obtained data records. Sample records retriever should comprise of major two components: query interface analyzer and query generator. Obtaining the necessary information of each attribute is the work of query interface analyzer, while producing a set of smart queries according to the information obtained by query interface analyzer is done by query generator.
The accuracy of a query is influenced by the fact that the query interfaces are very often different about the query capability. Like in the book topic, a query in the integrated interfaces “title=java and price<20$”. The web database can answer the questions accurately only if it contains both the two attributes. And if it only comprises only one of the attributes “title” or “price”, then the results returned from the web database design will contain quite many data records which all may not satisfy the query. Therefore, now, how to make the returned results satisfying is the challenging task (like, the minimal super set or maximal subset of the query).
It is imminent to integrate these web data bases for people to provide them easy access and to achieve information automatically because of the rapid increase in their usage. After proposing a comprehensive solution for Web database integration, each of the components which is also a research issue in this area is summarized.
At that point issues which are being centered around then and will be tended to later on are presented. Consequently, the point of my undertaking is to assemble a Web database mix framework and location a few issues here.
- Bing Liu, Grossman, R. and Yanhong Zhai, (2004). Mining Web Pages for Data Records. IEEE Intelligent Systems, 19(06), pp.49-55.
- BrightPlanet, (2015). Discontinuing CompletePlanet. [online] Available at: http://www.completeplanet.com/. [Accessed 9 Apr. 2015].
- Chang, K., He, B., Li, C., Patel, M. and Zhang, Z. (2004). Structured databases on the web. ACM SIGMOD Record, 33(3), p.61.
- Clausnitzer, A., Vogel, P. and Wiesener, S. (1995). A WWW interface to the OMNIS/Myriad literature retrieval engine. Computer Networks and ISDN Systems, 27(6), pp.1017-1026.
- De Bra, P. and Post, R. (1994). Computer Networks and ISDN Systems, 27(2), pp.183-192.