Toledo Data Scraping: February 2015

Saturday, 28 February 2015

Choose the Best Data Mining Company With This Simple Rule

Data mining is the analysis part of the knowledge finding in databases. It involves finding patterns in large data sets including processes like artificial intelligence, machine learning, statistics, and database systems. The main reason why companies do data mining is to transform a large set of data into understandable block of information that can be used for market knowledge. It allows companies to make informed business decisions.

Data mining was looked upon as a luxury until some time back, but businesses are waking up to the importance of the process by seeing the difference it makes. Most of the multinational corporations already have mining integrated as one of their core processes. Many companies don't make strategic decisions unless they have the complete data converted into useful information using mining techniques. However, it is not a cheap process and would require being put to good use in order to be able to justify its cost. This results in a demand of a data mining company that could fulfill the client's needs by being resourceful and economical at the same time.

Searching for the perfect data mining company for your business could become a lot easier if you follow one simple rule. The rule is to make sure you make enough strategic decisions that result in good profit or at least break even with a single session of mining the data, which allows you to justify the cost you put into the whole process. Then, choose the company that offers you the best quotation which allows you to maximize your profits and improve your business processes even more.

Most companies are not very stringent with their plans and pricing and would be happy to go that extra mile in order to help the client. That extra mile could include offering a discount on the whole process, or offering added services or extended time period in the same package and price as quoted. The way you negotiate with the company will decide the profit that you will make from the entire data mining process.

Data mining will not only improve your business decisions, it will improve your business processes as a whole. If used correctly, it will allow you to extract more out of the limited resources. It allows you to have comprehensive real time market knowledge that always keeps you ahead of your competitors. Therefore, putting in a few extra bucks to integrate it into your core business process is a really good idea. As mentioned earlier, if used correctly then it will not only justify its own cost but also increase profits manifold.

Choose the right company by integrating the whole process in your business and make the most of the market knowledge that is present on the internet. The power to make the best and the most informed decisions lies in your own hands, and data mining is one approach that will certainly get you a lot closer to your business goals.

Source:http://ezinearticles.com/?Choose-the-Best-Data-Mining-Company-With-This-Simple-Rule&id=8784911

Basics of Online Web Research, Web Mining & Data Extraction Services

The evolution of the World Wide Web and Search engines has brought the abundant and ever growing pile of data and information on our finger tips. It has now become a popular and important resource for doing information research and analysis.

Today, Web research services are becoming more and more complicated. It involves various factors such as business intelligence and web interaction to deliver desired results.

Web Researchers can retrieve web data using search engines (keyword queries) or browsing specific web resources. However, these methods are not effective. Keyword search gives a large chunk of irrelevant data. Since each webpage contains several outbound links it is difficult to extract data by browsing too.

Web mining is classified into web content mining, web usage mining and web structure mining. Content mining focuses on the search and retrieval of information from web. Usage mining extract and analyzes user behavior. Structure mining deals with the structure of hyperlinks.

Web mining services can be divided into three subtasks:

Information Retrieval (IR): The purpose of this subtask is to automatically find all relevant information and filter out irrelevant ones. It uses various Search engines such as Google, Yahoo, MSN, etc and other resources to find the required information.

Generalization: The goal of this subtask is to explore users' interest using data extraction methods such as clustering and association rules. Since web data are dynamic and inaccurate, it is difficult to apply traditional data mining techniques directly on the raw data.

Data Validation (DV): It tries to uncover knowledge from the data provided by former tasks. Researcher can test various models, simulate them and finally validate given web information for consistency.

Should you have any queries regarding Web research or Data mining applications, please feel free to contact us. We would be pleased to answer each of your queries in detail.

Source:http://ezinearticles.com/?Basics-of-Online-Web-Research,-Web-Mining-and-Data-Extraction-Services&id=4511101

Thursday, 26 February 2015

Web Data Extraction Services

Web Data Extraction from Dynamic Pages includes some of the services that may be acquired through outsourcing. It is possible to siphon information from proven websites through the use of Data Scrapping software. The information is applicable in many areas in business. It is possible to get such solutions as data collection, screen scrapping, email extractor and Web Data Mining services among others from companies providing websites such as Scrappingexpert.com.

Data mining is common as far as outsourcing business is concerned. Many companies are outsource data mining services and companies dealing with these services can earn a lot of money, especially in the growing business regarding outsourcing and general internet business. With web data extraction, you will pull data in a structured organized format. The source of the information will even be from an unstructured or semi-structured source.

In addition, it is possible to pull data which has originally been presented in a variety of formats including PDF, HTML, and test among others. The web data extraction service therefore, provides a diversity regarding the source of information. Large scale organizations have used data extraction services where they get large amounts of data on a daily basis. It is possible for you to get high accuracy of information in an efficient manner and it is also affordable.

Web data extraction services are important when it comes to collection of data and web-based information on the internet. Data collection services are very important as far as consumer research is concerned. Research is turning out to be a very vital thing among companies today. There is need for companies to adopt various strategies that will lead to fast means of data extraction, efficient extraction of data, as well as use of organized formats and flexibility.

In addition, people will prefer software that provides flexibility as far as application is concerned. In addition, there is software that can be customized according to the needs of customers, and these will play an important role in fulfilling diverse customer needs. Companies selling the particular software therefore, need to provide such features that provide excellent customer experience.

It is possible for companies to extract emails and other communications from certain sources as far as they are valid email messages. This will be done without incurring any duplicates. You will extract emails and messages from a variety of formats for the web pages, including HTML files, text files and other formats. It is possible to carry these services in a fast reliable and in an optimal output and hence, the software providing such capability is in high demand. It can help businesses and companies quickly search contacts for the people to be sent email messages.

It is also possible to use software to sort large amount of data and extract information, in an activity termed as data mining. This way, the company will realize reduced costs and saving of time and increasing return on investment. In this practice, the company will carry out Meta data extraction, scanning data, and others as well.

Source: http://ezinearticles.com/?Web-Data-Extraction-Services&id=4733722

Achieving Sustainability in Mining

There's so much that our planet gives us for our consumption. These things come in different shapes and sizes, and some of the most abundant of them are minerals. Minerals are essential for living in these modern times, and when it comes to extracting them, mining is still the primary method used.

One of the biggest issues that any industry faces is sustainability, and the mining sector is certainly no exception to it. Some of the things that serve to constrain sustainability in this industry are the ever-increasing demand minerals, the consumption of resources that are needed to extract and process metals, as well as the pollution caused by the process of extracting them.

Increasing Demand for Minerals

There's no question that there's growth in the extraction of construction minerals. As more and more countries become more industrialized, the demand for such minerals is almost directly proportional to the growth in the construction industry. In the 20th century, we saw a growth in the extraction of construction materials. Demand for ores and industrial minerals also increased.

Impacts

Aside from the obvious impact mining has on the environment, it can also have a negative social impact. In order to keep up with the demand for mined resources, there's also a subsequent increase in mining activities to meet such demand. During the course of conducting such activities, there can be times when certain things are overlooked, including the short, medium and even long-term effects of mining activities in the community where they are done. This is then where there arises a need to balance the economic benefits of mining versus its potential harmful effects on the environment.

Sustainability and Maximizing Mining Benefits

There are ways to maximize the benefits we can get from mining as we improve sustainability both on the environmental and social fronts. This was specifically addressed in the Plan of Implementation of the World Summit on Sustainable Development. It identified three priority areas:

a. Support efforts to address the environmental, economic, health and social impacts and benefits of mining, minerals and metals throughout their life cycle;

b. Enhance the participation of stakeholders, including local and indigenous communities and women, to play an active role in minerals, metals and mining development throughout the life cycles of mining operations; and

c. Foster sustainable mining practices through the provision of financial, technical and capacity-building support to developing countries and countries with economies in transition for the mining and processing of minerals.

As long as efforts are made for mining to be environmentally, economically, and socially sustainable, we can enjoy the many benefits of mining without worrying about and suffering the potentially harmful effects mining can have on people and nature.

Source: http://ezinearticles.com/?Achieving-Sustainability-in-Mining&id=8108499

Tuesday, 24 February 2015

Data Mining and Financial Data Analysis

Introduction:

Most marketers understand the value of collecting financial data, but also realize the challenges of leveraging this knowledge to create intelligent, proactive pathways back to the customer. Data mining - technologies and techniques for recognizing and tracking patterns within data - helps businesses sift through layers of seemingly unrelated data for meaningful relationships, where they can anticipate, rather than simply react to, customer needs as well as financial need. In this accessible introduction, we provides a business and technological overview of data mining and outlines how, along with sound business processes and complementary technologies, data mining can reinforce and redefine for financial analysis.

Objective:

1. The main objective of mining techniques is to discuss how customized data mining tools should be developed for financial data analysis.

2. Usage pattern, in terms of the purpose can be categories as per the need for financial analysis.

3. Develop a tool for financial analysis through data mining techniques.

Data mining:

Data mining is the procedure for extracting or mining knowledge for the large quantity of data or we can say data mining is "knowledge mining for data" or also we can say Knowledge Discovery in Database (KDD). Means data mining is : data collection , database creation, data management, data analysis and understanding.

There are some steps in the process of knowledge discovery in database, such as

1. Data cleaning. (To remove nose and inconsistent data)

2. Data integration. (Where multiple data source may be combined.)

3. Data selection. (Where data relevant to the analysis task are retrieved from the database.)

4. Data transformation. (Where data are transformed or consolidated into forms appropriate for mining by performing summary or aggregation operations, for instance)

5. Data mining. (An essential process where intelligent methods are applied in order to extract data patterns.)

6. Pattern evaluation. (To identify the truly interesting patterns representing knowledge based on some interesting measures.)

7. Knowledge presentation.(Where visualization and knowledge representation techniques are used to present the mined knowledge to the user.)

Data Warehouse:

A data warehouse is a repository of information collected from multiple sources, stored under a unified schema and which usually resides at a single site.

Text:

Most of the banks and financial institutions offer a wide verity of banking services such as checking, savings, business and individual customer transactions, credit and investment services like mutual funds etc. Some also offer insurance services and stock investment services.

There are different types of analysis available, but in this case we want to give one analysis known as "Evolution Analysis".

Data evolution analysis is used for the object whose behavior changes over time. Although this may include characterization, discrimination, association, classification, or clustering of time related data, means we can say this evolution analysis is done through the time series data analysis, sequence or periodicity pattern matching and similarity based data analysis.

Data collect from banking and financial sectors are often relatively complete, reliable and high quality, which gives the facility for analysis and data mining. Here we discuss few cases such as,

Eg, 1. Suppose we have stock market data of the last few years available. And we would like to invest in shares of best companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies. Such regularities may help predict future trends in stock market prices, contributing our decision making regarding stock investments.

Eg, 2. One may like to view the debt and revenue change by month, by region and by other factors along with minimum, maximum, total, average, and other statistical information. Data ware houses, give the facility for comparative analysis and outlier analysis all are play important roles in financial data analysis and mining.

Eg, 3. Loan payment prediction and customer credit analysis are critical to the business of the bank. There are many factors can strongly influence loan payment performance and customer credit rating. Data mining may help identify important factors and eliminate irrelevant one.

Factors related to the risk of loan payments like term of the loan, debt ratio, payment to income ratio, credit history and many more. The banks than decide whose profile shows relatively low risks according to the critical factor analysis.

We can perform the task faster and create a more sophisticated presentation with financial analysis software. These products condense complex data analyses into easy-to-understand graphic presentations. And there's a bonus: Such software can vault our practice to a more advanced business consulting level and help we attract new clients.

To help us find a program that best fits our needs-and our budget-we examined some of the leading packages that represent, by vendors' estimates, more than 90% of the market. Although all the packages are marketed as financial analysis software, they don't all perform every function needed for full-spectrum analyses. It should allow us to provide a unique service to clients.

The Products:

ACCPAC CFO (Comprehensive Financial Optimizer) is designed for small and medium-size enterprises and can help make business-planning decisions by modeling the impact of various options. This is accomplished by demonstrating the what-if outcomes of small changes. A roll forward feature prepares budgets or forecast reports in minutes. The program also generates a financial scorecard of key financial information and indicators.

Customized Financial Analysis by BizBench provides financial benchmarking to determine how a company compares to others in its industry by using the Risk Management Association (RMA) database. It also highlights key ratios that need improvement and year-to-year trend analysis. A unique function, Back Calculation, calculates the profit targets or the appropriate asset base to support existing sales and profitability. Its DuPont Model Analysis demonstrates how each ratio affects return on equity.

Financial Analysis CS reviews and compares a client's financial position with business peers or industry standards. It also can compare multiple locations of a single business to determine which are most profitable. Users who subscribe to the RMA option can integrate with Financial Analysis CS, which then lets them provide aggregated financial indicators of peers or industry standards, showing clients how their businesses compare.

iLumen regularly collects a client's financial information to provide ongoing analysis. It also provides benchmarking information, comparing the client's financial performance with industry peers. The system is Web-based and can monitor a client's performance on a monthly, quarterly and annual basis. The network can upload a trial balance file directly from any accounting software program and provide charts, graphs and ratios that demonstrate a company's performance for the period. Analysis tools are viewed through customized dashboards.

PlanGuru by New Horizon Technologies can generate client-ready integrated balance sheets, income statements and cash-flow statements. The program includes tools for analyzing data, making projections, forecasting and budgeting. It also supports multiple resulting scenarios. The system can calculate up to 21 financial ratios as well as the breakeven point. PlanGuru uses a spreadsheet-style interface and wizards that guide users through data entry. It can import from Excel, QuickBooks, Peachtree and plain text files. It comes in professional and consultant editions. An add-on, called the Business Analyzer, calculates benchmarks.

ProfitCents by Sageworks is Web-based, so it requires no software or updates. It integrates with QuickBooks, CCH, Caseware, Creative Solutions and Best Software applications. It also provides a wide variety of businesses analyses for nonprofits and sole proprietorships. The company offers free consulting, training and customer support. It's also available in Spanish.

Source: http://ezinearticles.com/?Data-Mining-and-Financial-Data-Analysis&id=2752017

Metallurgist Roles in Mining Companies

Mining of metals and minerals is a growth industry, especially in Africa, providing job opportunities for metallurgists to work in various roles. Positions are well paid as metallurgists are required to have at least one degree from an accredited university or college. The preferred qualifications are a Bachelors Degree in Extractive Metallurgy or Metallurgical Engineering or BSc. in Chemical Engineering with a major in Mineral Processing. This is not a profession where candidates can learn the required skills on the job although experience can be gained throughout their career by expanding their exposure to different types of work on mines.

The type of work they do

The most common metallurgist roles include project management, consulting, technical or site management and research. For example, on a mine he/she would be expected to:

•    Design work programs and manage all metallurgical testing both in-house and with external service providers and laboratories

•    Work with the senior team to review and evaluate technical solutions

•    Liaise with geologists and other technical personnel to ensure the most suitable metallurgical solution is understood and employed

•    Constantly re-evaluate the metallurgical performance

At middle management level, as a project manager they would coordinate day-to-day mining activities, manage quality assurance and generally ensure a smooth operation. Mining companies look for a minimum of 5 years experience before they post these types of managers to remotely located mines. At the most senior level metallurgists can become mine managers which includes coordinating all operations, staffing, running the site itself, selecting the extraction process, and resolving operational and business issues.

Furthering a career

Metallurgists with further education and extensive experience in many technical processes become professional consultants or researchers either working directly for a large mining company or for a consulting firm contracted to it. Their role may be to advise clients on process engineering, to perform cost analyses or do budgeting. They may get involved in environmental impact assessments, HSEQ and social responsibility as well. The mining industry is constantly updating its methods of extraction and waste management in order to stay profitable and needs researchers to continue to explore new methods and processes. Pay levels vary depending on work experience, area of expertise and the location where they are posted.

Some of the personal attributes required to be successful in this field are to be an effective team player, have a high level of inter-personal communication skills and be able to express yourself in writing. A good knowledge of French is often asked for when an African posting is offered. Because of inhospitable locations and remoteness of mines, most of the postings attract single people or more mature staff who do not have school-going children.

Source: http://ezinearticles.com/?Metallurgist-Roles-in-Mining-Companies&id=6678129

Friday, 20 February 2015

Data Mining vs Screen-Scraping

Data mining isn't screen-scraping. I know that some people in the room may disagree with that statement, but they're actually two almost completely different concepts.

In a nutshell, you might state it this way: screen-scraping allows you to get information, where data mining allows you to analyze information. That's a pretty big simplification, so I'll elaborate a bit.

The term "screen-scraping" comes from the old mainframe terminal days where people worked on computers with green and black screens containing only text. Screen-scraping was used to extract characters from the screens so that they could be analyzed. Fast-forwarding to the web world of today, screen-scraping now most commonly refers to extracting information from web sites. That is, computer programs can "crawl" or "spider" through web sites, pulling out data. People often do this to build things like comparison shopping engines, archive web pages, or simply download text to a spreadsheet so that it can be filtered and analyzed.

Data mining, on the other hand, is defined by Wikipedia as the "practice of automatically searching large stores of data for patterns." In other words, you already have the data, and you're now analyzing it to learn useful things about it. Data mining often involves lots of complex algorithms based on statistical methods. It has nothing to do with how you got the data in the first place. In data mining you only care about analyzing what's already there.

The difficulty is that people who don't know the term "screen-scraping" will try Googling for anything that resembles it. We include a number of these terms on our web site to help such folks; for example, we created pages entitled Text Data Mining, Automated Data Collection, Web Site Data Extraction, and even Web Site Ripper (I suppose "scraping" is sort of like "ripping"). So it presents a bit of a problem-we don't necessarily want to perpetuate a misconception (i.e., screen-scraping = data mining), but we also have to use terminology that people will actually use.

Source:http://ezinearticles.com/?Data-Mining-vs-Screen-Scraping&id=146813

Wednesday, 18 February 2015

There is No Need to Disrupt the Schedule to Keep the Kitchen Canopy and Extraction System Clean

After taking over a large and beautiful stately hotel its new owner quickly realised that the kitchen extract system would not be straightforward to maintain because the duct work for the extract system was somewhat ancient and therefore would be difficult to clean.

A prestige hotel needs to maintain a high level of hygiene as well as to minimise the risk of a kitchen fire.

So, if replacing the entire system is not an option what can the new owner do to find a solution that would meet exacting standards of cleanliness and ensure that the risk of a fire starting in the system is minimised while ensuring that the cleaning does disrupt the operation of the hotel and restaurant as a business?

Using an experienced specialist commercial cleaning service to asses the establishment, the types of food cooked, how and at what level of intensity is the first step.

It is difficult without this information to advice on how maintenance should be carried out.

The frequency of the cleaning cycle for a canopy and its components depends not only on the regularity and duration of cooking below but also on the type of cooking and the ingredients being used.

Where the kitchen use is light canopies and extract systems may only need a 12-month cycle for maintenance and cleaning. However, in a busy hotel, kitchen activity is most likely to be heavy and the cleaning company may advise a three or four-month cycle.

Grease filters and canopies over the cookers should ideally be designed, sized and constructed to be robust enough for regular washing in a commercial dishwasher, which is the most thorough and efficient method of cleaning them yourself.

It's important to make sure when re-installing filters that they are fitted the right way around with any framework drain holes at the lowest, front edge. Of course, grease filters are covered with a coating of grease and can therefore be slippery and difficult to handle. Appropriate protyective gloves should be used when handling them.

The canopies and their component parts should be designed to be easy to clean, but if they are not, provided the cleaning intervals are fairly frequent, regular washing with soap or mild detergent and warm water, followed by a clean water rinse might be adequate. If too long a period is left between cleans, grease will become baked-on and require special attention.

No grease filtration is 100% efficient and therefore a certain amount of grease passes through the filters to be deposited on the internal surfaces of the filter housings and ductwork.

Left unattended, this layer of grease on the non-visible surfaces of the canopy creates both hygiene and fire risks.

Deciding on when cleaning should take place, and how often, is something an experienced specialist cleaning company can help with. The simplest guide is that if a surface or component looks dirty, then it needs cleaning.

Most important, however, is regular inspection of all surfaces and especially non-visible ones. The maintenance schedule for any kitchen installation should include inspections.

Copyright (c) 2010 Alison Withers

A regular maintenance and cleaning schedule is not impossible even in the kitchen of a hotel with an antiquated canopy and duct system with the help of a specialist commercial cleaning company to advise on how to do it without disrupting the work flow, as writer Ali Withers discovers.

Source: http://ezinearticles.com/?There-is-No-Need-to-Disrupt-the-Schedule-to-Keep-the-Kitchen-Canopy-and-Extraction-System-Clean&id=4877266

Coal Seam Gas - Extraction and Processing

With rapidly depleting natural resources, people around the globe are looking for new sources of energy. Lots of people don't think much of it, but doing this is an excellent ecological move forward and may even be a lucrative endeavour. Australia has one the most significant deposits of a recently discovered gas known as coal seam gas. The deposit present in areas such as New South Wales is far more significant than the others since it contains little methane and much more carbon dioxide.

What is coal seam gas?

Coal bed methane is the more general term for this substance. It is a form of natural gas taken from substantial coal beds. The existence of this material usually spelled hazard for many sites. This stopped in recent decades, when specialists discovered its potential as an energy source. It's now among the most important sources of energy in a number of countries, particularly in North America. Extraction within australia is actually rapidly developing because of rich deposits in various parts of the country.

Extraction

The extraction procedure is reasonably challenging. It calls for heavy drilling, water pumping, and tubing. Though there are a variety of different processes, pipeline construction(an initial step) is perhaps one of the most important. The foundation of the course of action can spell a real difference between the failure or success of your undertaking.

Working with a Contractor

Pipeline construction and design is serious business. Seasoned contractors may be hard to get considering the fact that Australia's coal seam gas industry is still fairly young. You'll find only a limited number of completed and working projects across the country. There are several things to consider when getting a contractor for the project.

Find one with substantial experience within the industry sector. Some service providers have operations outside the country, especially in Canada And America. This is something you should look out for, as advancement of the gas originated there. Providers with completed projects in the said area can have the solutions required for any project to take off.

The construction process involves several basic steps. It is important the service provider you work with addresses all of your needs. Below are a few of the important supplementary services to look for.

- Pipeline design, production, and installation

- Custom ploughing (to achieve specialized trenching requirements)

- Protection and repair of pipelines with the use of various liners

- Pressure assessment and commissioning

These are only the fundamentals of pipeline construction. Sourcing coal seam gas involves many others. Do thorough research to ensure the service provider you employ is capable of completing all the necessary tasks. Other elements of the undertaking include engineering plus site preparation and rehabilitation. This industrial sector may be profitable if one makes all of the proper moves.

Avoid making uninformed decisions by doing as much research as you possibly can. Use the web to your advantage to look into a company's profile. Look for a portfolio of the projects they have completed in the past. You can gauge their trustworthiness based on their volume of clients. Check out the scope of their operations and the projects they finished.

You should also think about company policies concerning the quality of their work, safety and health, along with their policies concerning communities and the environment. These are seemingly minute but important details when searching for a contractor for pipeline construction projects.

Source: http://ezinearticles.com/?Coal-Seam-Gas---Extraction-and-Processing&id=6954936

Monday, 16 February 2015

Why Common Measures Taken To Prevent Scraping Aren't Effective

Bots became more powerful in 2014. As the war continues, let’s take a closer look at why common strategies to prevent scraping didn’t pay off.

With the market for online businesses expanding rapidly, the development teams behind these online portals are under great amounts of pressure to keep up in the race. Scalability, availability and responsiveness are some of the commonly faced problems for a growing online business portal. As the value of content is increasing, content theft has become an increasing problem in the form of web scraping.

Competitors have learned to stay ahead of the race by using bots to scrape. While how these bots could be harmful is something worth talking about, it is not the main scope of this article. This article discusses some of the commonly used weapons to fight bots and brings to light their effectiveness in reality.

We come across many developers who claim to have taken measures to prevent their sites from being scraped. A common belief is that these below listed techniques reduce scraping activities significantly on a website. While some of these methods could actually work in concept, we were interested to explore how effective they were in practice.

Most Commonly used techniques to Prevent Scraping:

•    Setting up robots.txt – Surprisingly, this technique is used against malicious bots! Why this wouldn’t work is pretty straight forward – robots.txt is an agreement between websites and search engine bots to prevent search engine bots from accessing sensitive information. No malicious bot (or the scraper behind it) in it’s right mind would obey robots.txt. This is the most ineffective method to prevent scraping.

•    Filtering requests by User agent – The user agent string of a client is set by the client itself. One method is to obtain this from the HTTP header of a request. This way, a request can be filtered even before the content is served to the request. We observed that very few bots (approximately less than 10%), used the default user agent string which belonged to a scraping tool or was an empty string. Once their requests to the website were filtered based on the user agent, it didn’t take too long for scrapers to realize this and change their user agent to that of any well known browser. This method merely stops new bots written by inexperienced scrapers for a few hours.

•    Blacklisting the IP address – Seeking out to an IP blacklisting service is much easier than having to perform the hectic process of capturing more metrics from page requests and analyzing server logs. There are plenty of third party services which maintain a database of blacklisted IPs. In our hunt for a suitable blacklisting service, we found that using a third party DNSBL/RBL service was not effective as these services blacklisted only email spambot servers and were not effective in preventing scraping bots. Less than 2% of scraping bots were detected for one of our customer’s when we did a trial run.

•    Throwing CAPTCHA – A very well know practice to stop bots is to throw CAPTCHA on pages with sensitive content. Although effective against bots, CAPTCHA is thrown to all clients requesting the web page irrespective of whether it is a human or a bot. This method often antagonizes users and hence reduces traffic to the website. Some more insights to the new NO CAPTCHA Re-CAPTCHA by Google can be found in our previous blog post.

•    Honey pot or Honey trap – Honey pots are a brilliant trap mechanism to capture new bots (scrapers who are not well versed with structure of every page) on the website. But, this approach poses a lesser known threat of reducing the page rank on search engines. Here’s why – Search engine bots visit these links and might get trapped accidentally. Even if exceptions to the page were made by disallowing a set of known user agents, the links to the traps might be indexed by a search engine bot. These links are interpreted as dead, irrelevant or fake links by search engines. With more such traps, the ranking of the website decreases considerably. Furthermore, filtering requests based on user agent can exploited as discussed above. In short, honey pots are risky business which must be handled very carefully.

To summarize, these prevention strategies listed are either weak or require constant monitoring and regular maintenance to keep them effective. In practice bots are far more challenging than they actually seem to be.

What to expect in 2015?

With increasing need for scraping, the number of scraping tools and expert scrapers are also increasing which simply means bots are going to be an increasing problem. In fact, the usage of headless browsers i.e, browser like bots which are used to scrape are increasing and scrapers are no longer relying on wget, curl and html parsers. Preventing malicious bots from stealing content without actually disturbing the genuine traffic from humans and search engine bots is just going get harder. By the end of the year, we could infer from our database that almost half of an average website’s traffic is caused by bots. And a whopping 30-40% is caused by malicious bots. We believe this is only going to increase if we do not step up to take action!

p.s. If you think you are facing similar problems, why not request for more information? Also, if you do not have the time or bandwidth for taking such actions, scraping prevention and stopping malicious bots is something we provide as a service. How about a free trial?

Source:http://www.shieldsquare.com/why-common-measures-taken-to-prevent-scraping-arent-effective/

Thursday, 12 February 2015

I Don’t Need No Stinking API: Web Scraping For Fun and Profit

If you’ve ever needed to pull data from a third party website, chances are you started by checking to see if they had an official API. But did you know that there’s a source of structured data that virtually every website on the internet supports automatically, by default?
scraper toolThat’s right, we’re talking about pulling our data straight out of HTML — otherwise known as web scraping. Here’s why web scraping is awesome:

Any content that can be viewed on a webpage can be scraped. Period.

If a website provides a way for a visitor’s browser to download content and render that content in a structured way, then almost by definition, that content can be accessed programmatically. In this article, I’ll show you how.

Over the past few years, I’ve scraped dozens of websites — from music blogs and fashion retailers to the USPTO and undocumented JSON endpoints I found by inspecting network traffic in my browser.

There are some tricks that site owners will use to thwart this type of access — which we’ll dive into later — but they almost all have simple work-arounds.

Why You Should Scrape

But first we’ll start with some great reasons why you should consider web scraping first, before you start looking for APIs or RSS feeds or other, more traditional forms of structured data.

Websites are More Important Than APIs

The biggest one is that site owners generally care way more about maintaining their public-facing visitor website than they do about their structured data feeds.

We’ve seen it very publicly with Twitter clamping down on their developer ecosystem, and I’ve seen it multiple times in my projects where APIs change or feeds move without warning.

Sometimes it’s deliberate, but most of the time these sorts of problems happen because no one at the organization really cares or maintains the structured data. If it goes offline or gets horribly mangled, no one really notices.

Whereas if the website goes down or is having issues, that’s a more of an in-your-face, drop-everything-until-this-is-fixed kind of problem, and gets dealt with quickly.

No Rate-Limiting

Another thing to think about is that the concept of rate-limiting is virtually non-existent for public websites.

Aside from the occasional captchas on sign up pages, most businesses generally don’t build a lot of defenses against automated access. I’ve scraped a single site for over 4 hours at a time and not seen any issues.

Unless you’re making concurrent requests, you probably won’t be viewed as a DDOS attack, you’ll just show up as a super-avid visitor in the logs, in case anyone’s looking.

Anonymous Access

There are also fewer ways for the website’s administrators to track your behavior, which can be useful if you want gather data more privately.

With APIs, you often have to register to get a key and then send along that key with every request. But with simple HTTP requests, you’re basically anonymous besides your IP address and cookies, which can be easily spoofed.

The Data’s Already in Your Face

Web scraping is also universally available, as I mentioned earlier. You don’t have to wait for a site to open up an API or even contact anyone at the organization. Just spend some time browsing the site until you find the data you need and figure out some basic access patterns — which we’ll talk about next.

Let’s Get to Scraping

So you’ve decided you want to dive in and start grabbing data like a true hacker. Awesome.

Just like reading API docs, it takes a bit of work up front to figure out how the data is structured and how you can access it. Unlike APIs however, there’s really no documentation so you have to be a little clever about it.

I’ll share some of the tips I’ve learned along the way.

Fetching the Data

So the first thing you’re going to need to do is fetch the data. You’ll need to start by finding your “endpoints” — the URL or URLs that return the data you need.

If you know you need your information organized in a certain way — or only need a specific subset of it — you can browse through the site using their navigation. Pay attention to the URLs and how they change as you click between sections and drill down into sub-sections.

The other option for getting started is to go straight to the site’s search functionality. Try typing in a few different terms and again, pay attention to the URL and how it changes depending on what you search for. You’ll probably see a GET parameter like q= that always changes based on you search term.

Try removing other unnecessary GET parameters from the URL, until you’re left with only the ones you need to load your data. Make sure that there’s always a beginning ? to start the query string and a & between each key/value pair.

Dealing with Pagination

At this point, you should be starting to see the data you want access to, but there’s usually some sort of pagination issue keeping you from seeing all of it at once. Most regular APIs do this as well, to keep single requests from slamming the database.

Usually, clicking to page 2 adds some sort of offset= parameter to the URL, which is usually either the page number or else the number of items displayed on the page. Try changing this to some really high number and see what response you get when you “fall off the end” of the data.

With this information, you can now iterate over every page of results, incrementing the offset parameter as necessary, until you hit that “end of data” condition.

The other thing you can try doing is changing the “Display X Per Page” which most pagination UIs now have. Again, look for a new GET parameter to be appended to the URL which indicates how many items are on the page.

Try setting this to some arbitrarily large number to see if the server will return all the information you need in a single request. Sometimes there’ll be some limits enforced server-side that you can’t get around by tampering with this, but it’s still worth a shot since it can cut down on the number of pages you must paginate through to get all the data you need.

AJAX Isn’t That Bad!

Sometimes people see web pages with URL fragments # and AJAX content loading and think a site can’t be scraped. On the contrary! If a site is using AJAX to load the data, that probably makes it even easier to pull the information you need.

The AJAX response is probably coming back in some nicely-structured way (probably JSON!) in order to be rendered on the page with Javscript.

All you have to do is pull up the network tab in Web Inspector or Firebug and look through the XHR requests for the ones that seem to be pulling in your data.

Once you find it, you can leave the crufty HTML behind and focus instead on this endpoint, which is essentially an undocumented API.

(Un)structured Data?

Now that you’ve figured out how to get the data you need from the server, the somewhat tricky part is getting the data you need out of the page’s markup.

Use CSS Hooks

In my experience, this is usually straightforward since most web designers litter the markup with tons of classes and ids to provide hooks for their CSS.

You can piggyback on these to jump to the parts of the markup that contain the data you need.

Just right click on a section of information you need and pull up the Web Inspector or Firebug to look at it. Zoom up and down through the DOM tree until you find the outermost <div> around the item you want.

This <div> should be the outer wrapper around a single item you want access to. It probably has some class attribute which you can use to easily pull out all of the other wrapper elements on the page. You can then iterate over these just as you would iterate over the items returned by an API response.

A note here though: the DOM tree that is presented by the inspector isn’t always the same as the DOM tree represented by the HTML sent back by the website. It’s possible that the DOM you see in the inspector has been modified by Javascript — or sometime even the browser, if it’s in quirks mode.

Once you find the right node in the DOM tree, you should always view the source of the page (“right click” > “View Source”) to make sure the elements you need are actually showing up in the raw HTML.

This issue has caused me a number of head-scratchers.

Get a Good HTML Parsing Library

It is probably a horrible idea to try parsing the HTML of the page as a long string (although there are times I’ve needed to fall back on that). Spend some time doing research for a good HTML parsing library in your language of choice.

Most of the code I write is in Python, and I love BeautifulSoup for its error handling and super-simple API. I also love its motto:

You didn’t write that awful page. You’re just trying to get some data out of it. Beautiful Soup is here to help. :)

You’re going to have a bad time if you try to use an XML parser since most websites out there don’t actually validate as properly formed XML (sorry XHTML!) and will give you a ton of errors.

A good library will read in the HTML that you pull in using some HTTP library (hat tip to the Requests library if you’re writing Python) and turn it into an object that you can traverse and iterate over to your heart’s content, similar to a JSON object.

Some Traps To Know About

I should mention that some websites explicitly prohibit the use of automated scraping, so it’s a good idea to read your target site’s Terms of Use to see if you’re going to make anyone upset by scraping.

For two-thirds of the website I’ve scraped, the above steps are all you need. Just fire off a request to your “endpoint” and parse the returned data.

But sometimes, you’ll find that the response you get when scraping isn’t what you saw when you visited the site yourself.

When In Doubt, Spoof Headers

Some websites require that your User Agent string is set to something they allow, or you need to set certain cookies or other headers in order to get a proper response.

Depending on the HTTP library you’re using to make requests, this is usually pretty straightforward. I just browse the site in my web browser and then grab all of the headers that my browser is automatically sending. Then I put those in a dictionary and send them along with my request.

Note that this might mean grabbing some login or other session cookie, which might identify you and make your scraping less anonymous. It’s up to you how serious of a risk that is.

Content Behind A Login

Sometimes you might need to create an account and login to access the information you need. If you have a good HTTP library that handles logins and automatically sending session cookies (did I mention how awesome Requests is?), then you just need your scraper login before it gets to work.

Note that this obviously makes you totally non-anonymous to the third party website so all of your scraping behavior is probably pretty easy to trace back to you if anyone on their side cared to look.

Rate Limiting

I’ve never actually run into this issue myself, although I did have to plan for it one time. I was using a web service that had a strict rate limit that I knew I’d exceed fairly quickly.

Since the third party service conducted rate-limiting based on IP address (stated in their docs), my solution was to put the code that hit their service into some client-side Javascript, and then send the results back to my server from each of the clients.

This way, the requests would appear to come from thousands of different places, since each client would presumably have their own unique IP address, and none of them would individually be going over the rate limit.

Depending on your application, this could work for you.

Poorly Formed Markup

Sadly, this is the one condition that there really is no cure for. If the markup doesn’t come close to validating, then the site is not only keeping you out, but also serving a degraded browsing experience to all of their visitors.

It’s worth digging into your HTML parsing library to see if there’s any setting for error tolerance. Sometimes this can help.

If not, you can always try falling back on treating the entire HTML document as a long string and do all of your parsing as string splitting or — God forbid — a giant regex.

—

Well there’s 2000 words to get you started on web scraping. Hopefully I’ve convinced you that it’s actually a legitimate way of collecting data.

It’s a real hacker challenge to read through some HTML soup and look for patterns and structure in the markup in order to pull out the data you need. It usually doesn’t take much longer than reading some API docs and getting up to speed with a client. Plus it’s way more fun!

Source: https://blog.hartleybrody.com/web-scraping/